Home Credit Default Risk -Kaggle Project - AML 526 Group 3 - Phase2

  • Jugal Shah
  • Deepak Deopura
  • Andrew Doto
  • Gautham Nagendra

Abstract

Home Credit Group published a dataset on Kaggle website with objective of encouraging aspiring data scientists and kagglers to solve a problem of wrongfully loan rejection. As part of the AML course we have selected this project to practice and demonstrate our Machine learning skills.

Phase 2 of the HCDR project contained several main elements: Feature engineering, creating a preprocessing pipeline, and hyperparameter tuning. Our group utilized the supplementary files as well as the main dataset to engineer new features. The group split up the tasks to tackle the supplemental files separately. We then created a preprocessing pipeline to prepare the dataset for final hyperparameter tuning, which included adding new features, cleaning/standardizing/imputing, and dimensionality reduction. Once the final dataset was prepared, we experimented with hyperparameter tuning and balancing methods. We tuned hyperparameters for XGBoost and LightGBM models, as well as tested manual and SMOTE balancing techniques. After experimentation, our best model was a LightGBM model using SMOTE balancing, which secured a Kaggle public score of 0.794 and a private score of 0.789.

Data Description

The seven different source that would be used to further analyze are: (Credit)

  • application_{train|test}.csv
  • bureau.csv
  • bureau_balance.csv
  • POS_CASH_balance.csv
  • credit_card_balance.csv
  • previous_application.csv
  • installments_payments.csv
  • HomeCredit_columns_description.csv

Phase Description and Workflow

Phase 2 work was divided into three parts. All parts can be found in this notebook.

PART 1 : EDA AND DEVELOPING FEATURES USING SUPPLEMETARY FILES

At the end of Part1 we exported all supllementary processed files that can directly merged with main application train or test dataset.

PART 2 : BUILDING PRE-PROCESSING PIPELINE

At the end of pre-processing, data is exported that can directly use by machine learning models.

PART 3: DEVELOPING MACHINE LEARNING MODEL

This part involves hyper parameter tuning as well as performing analysis using unbalanced dataset, balanced dataset (manual as well as auto), scaled data set etc.

PART 1 - EDA and Feature Engineering

Common functions

Common functions are created so that we can utilize those for give supplemental files.

  • Read file function take list of csv files to read along with path and return list of data frames, one data frame corresponding to one file
  • Missing value check – Used to check null value count of each column in data frame based on given percent threshold for null value. For example, if threshold is 50 then this function will return all columns which have more than 50% missing values along with total missing count and bar plot.
  • OHE – This is custom one hot encoding function so that we can do feature engineering based on values in categorical column
  • Reduce memory usages – There are some files which are huge size so this function take dataframe and reduce memory usages of that dataframe.

Read files

In [3]:
# read list of input files from given path
#path = "/content/drive/My Drive/I526 Final Project/data/"
path = "../Project/data/raw/"
def read_files(list_of_files, path, print_details = True):
    df_list = []
    for file in list_of_files:
        chunksize = 500000
        filename = path + file
        
        i = 1
        for chunk in tqdm(pd.read_csv(filename, chunksize=chunksize, low_memory=False)):
            df = chunk if i == 1 else pd.concat([df, chunk])
            if print_details:
                print('-->Read Chunk...', i)
            i += 1
        df_list.append(df)
        print(file + " ... Read Completed")
        print('*'*40)
    return df_list

Missing value check

In [4]:
# retun list of columns from datafram which have missing values with given threshhold
def missingdata(data,missing_threshhold):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    ms=pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    ms= ms[ms["Percent"] > missing_threshhold]
    print('\n\nNumber of columns with over', missing_threshhold ,'percent missing values \n-------------------')
    print(len(ms))
    f,ax =plt.subplots(figsize=(15,10))
    plt.xticks(rotation='90')
    fig=sns.barplot(ms.index, ms["Percent"],color="red",alpha=0.8)
    plt.xlabel('Column Name', fontsize=15)
    plt.ylabel('Percent of missing values', fontsize=15)
    plt.title('Percent missing data by Columns', fontsize=15)
    return ms

OHE

In [5]:
def one_hot_encoder(df, nan_as_category = True, columns = None):
    original_columns = list(df.columns)
    if columns == None:
        categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    else:
        categorical_columns = columns
    df = pd.get_dummies(df, columns= categorical_columns, dummy_na= nan_as_category)
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns

Read files

In [6]:
print(os.listdir("../Project/data/raw/"))
['Final_Files', 'HomeCredit_columns_description.csv', 'HomeCredit_columns_description.xlsx', 'POS_CASH_balance.csv', 'application_test.csv', 'application_train.csv', 'bureau.csv', 'bureau_balance.csv', 'credit_card_balance.csv', 'installments_payments.csv', 'installments_payments_1.csv', 'installments_payments_2.csv', 'previous_application.csv', '~$HomeCredit_columns_description.xlsx']

Reduce memory usage

In [7]:
#reduce memory usages of given datafram
def reduce_memory(df):
    start_mem = df.memory_usage().sum() / 1024**2
    print('memory usage is ' , round(start_mem), 'MB')
    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('end memory usage is ' , round(end_mem), 'MB')
    return df

Bureau data

Bureau data description

Bureau dataset has two files, bureau.csv and bureau_balance.csv.

Bureau data set has

  • All client's previous credits provided by other financial institutions that were reported to Credit Bureau for given applicant in application_train dataset.
  • There are as equal number of rows of credits the client had in Credit Bureau before the application date in application_train data.

Some of features in Bureau dataset are

  • SK_BUREAU_ID- Primary key
  • CREDIT_ACTIVE-Status of the Credit Bureau reported credits
  • CREDIT_DAY_OVERDUE- Number of days past due on CB credit at the time of application for related loan in applicaiton train data
  • AMT_CREDIT_SUM- Current credit amount for the Credit Bureau credit
  • AMT_CREDIT_SUM_DEBT-Current debt on Credit Bureau credit
  • AMT_CREDIT_SUM_LIMIT-Current credit limit of credit card reported in Credit Bureau
  • AMT_CREDIT_SUM_OVERDUE-Current amount overdue on Credit Bureau credit
  • CREDIT_TYPE- type of Credit Bureau credit (Car, cash,...)

Bureau balance data set has

  • Monthly balances of previous credits in Credit Bureau.
  • This dataset has one row for each month of history of every previous credit reported to Credit Bureau. This is related to Bureau on SK_ID_BUREAU key

Some of features in Bureau dataset are

  • MONTHS_BALANCE- Month of balance relative to application date
  • STATUS- Status of Credit Bureau loan during the month (active, closed..)

Bureau data EDA

In [8]:
app_train,bureau, bureau_balance = read_files(['application_train.csv','bureau.csv', 'bureau_balance.csv'], path, print_details=False)
1it [00:05,  5.24s/it]
0it [00:00, ?it/s]
application_train.csv ... Read Completed
****************************************
4it [00:03,  1.11it/s]
1it [00:00,  7.80it/s]
bureau.csv ... Read Completed
****************************************
55it [00:16,  3.26it/s]
bureau_balance.csv ... Read Completed
****************************************

In [9]:
bureau.shape
Out[9]:
(1716428, 17)
In [10]:
bureau_balance.shape
Out[10]:
(27299925, 3)
In [11]:
bureau.dtypes.value_counts()
Out[11]:
float64    8
int64      6
object     3
dtype: int64
In [12]:
bureau.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
Out[12]:
CREDIT_ACTIVE       4
CREDIT_CURRENCY     4
CREDIT_TYPE        15
dtype: int64
In [13]:
bureau.select_dtypes('int').apply(pd.Series.nunique, axis = 0)
Out[13]:
SK_ID_CURR             305811
SK_ID_BUREAU          1716428
DAYS_CREDIT              2923
CREDIT_DAY_OVERDUE        942
CNT_CREDIT_PROLONG         10
DAYS_CREDIT_UPDATE       2982
dtype: int64
In [14]:
missingdata(bureau,1)

Number of columns with over 1 percent missing values 
-------------------
6
Out[14]:
Total Percent
AMT_ANNUITY 1226791 71.473490
AMT_CREDIT_MAX_OVERDUE 1124488 65.513264
DAYS_ENDDATE_FACT 633653 36.916958
AMT_CREDIT_SUM_LIMIT 591780 34.477415
AMT_CREDIT_SUM_DEBT 257669 15.011932
DAYS_CREDIT_ENDDATE 105553 6.149573
In [15]:
plt.figure(figsize=(15,5))
ax=sns.countplot(x='CREDIT_ACTIVE', data=bureau, order=bureau['CREDIT_ACTIVE'].value_counts(normalize=True).index);

plt.title('CREDIT_ACTIVE');
plt.xticks(rotation=90);
In [16]:
plt.figure(figsize=(15,5))
ax=sns.countplot(x='CREDIT_TYPE', data=bureau, order=bureau['CREDIT_TYPE'].value_counts(normalize=True).index);

plt.title('CREDIT_TYPE');
plt.xticks(rotation=90);
In [17]:
columns = bureau.columns[bureau.dtypes == 'object']

for c in columns:
    display((bureau[c].value_counts(normalize=True)*100).round(2),(bureau[c].value_counts().round(2)))
    print('*'*40)
Closed      62.88
Active      36.74
Sold         0.38
Bad debt     0.00
Name: CREDIT_ACTIVE, dtype: float64
Closed      1079273
Active       630607
Sold           6527
Bad debt         21
Name: CREDIT_ACTIVE, dtype: int64
****************************************
currency 1    99.92
currency 2     0.07
currency 3     0.01
currency 4     0.00
Name: CREDIT_CURRENCY, dtype: float64
currency 1    1715020
currency 2       1224
currency 3        174
currency 4         10
Name: CREDIT_CURRENCY, dtype: int64
****************************************
Consumer credit                                 72.92
Credit card                                     23.43
Car loan                                         1.61
Mortgage                                         1.07
Microloan                                        0.72
Loan for business development                    0.12
Another type of loan                             0.06
Unknown type of loan                             0.03
Loan for working capital replenishment           0.03
Cash loan (non-earmarked)                        0.00
Real estate loan                                 0.00
Loan for the purchase of equipment               0.00
Loan for purchase of shares (margin lending)     0.00
Interbank credit                                 0.00
Mobile operator loan                             0.00
Name: CREDIT_TYPE, dtype: float64
Consumer credit                                 1251615
Credit card                                      402195
Car loan                                          27690
Mortgage                                          18391
Microloan                                         12413
Loan for business development                      1975
Another type of loan                               1017
Unknown type of loan                                555
Loan for working capital replenishment              469
Cash loan (non-earmarked)                            56
Real estate loan                                     27
Loan for the purchase of equipment                   19
Loan for purchase of shares (margin lending)          4
Interbank credit                                      1
Mobile operator loan                                  1
Name: CREDIT_TYPE, dtype: int64
****************************************
In [18]:
columns = bureau.columns[bureau.dtypes == 'float']
In [19]:
columns.tolist()
Out[19]:
['DAYS_CREDIT_ENDDATE',
 'DAYS_ENDDATE_FACT',
 'AMT_CREDIT_MAX_OVERDUE',
 'AMT_CREDIT_SUM',
 'AMT_CREDIT_SUM_DEBT',
 'AMT_CREDIT_SUM_LIMIT',
 'AMT_CREDIT_SUM_OVERDUE',
 'AMT_ANNUITY']
In [20]:
corr_matrix = bureau.corr()
In [21]:
corr_matrix
Out[21]:
SK_ID_CURR SK_ID_BUREAU DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE DAYS_CREDIT_UPDATE AMT_ANNUITY
SK_ID_CURR 1.000000 0.000135 0.000266 0.000283 0.000456 -0.000648 0.001329 -0.000388 0.001179 -0.000790 -0.000304 -0.000014 0.000510 -0.002727
SK_ID_BUREAU 0.000135 1.000000 0.013015 -0.002628 0.009107 0.017890 0.002290 -0.000740 0.007962 0.005732 -0.003986 -0.000499 0.019398 0.001799
DAYS_CREDIT 0.000266 0.013015 1.000000 -0.027266 0.225682 0.875359 -0.014724 -0.030460 0.050883 0.135397 0.025140 -0.000383 0.688771 0.005676
CREDIT_DAY_OVERDUE 0.000283 -0.002628 -0.027266 1.000000 -0.007352 -0.008637 0.001249 0.002756 -0.003292 -0.002355 -0.000345 0.090951 -0.018461 -0.000339
DAYS_CREDIT_ENDDATE 0.000456 0.009107 0.225682 -0.007352 1.000000 0.248825 0.000577 0.113683 0.055424 0.081298 0.095421 0.001077 0.248525 0.000475
DAYS_ENDDATE_FACT -0.000648 0.017890 0.875359 -0.008637 0.248825 1.000000 0.000999 0.012017 0.059096 0.019609 0.019476 -0.000332 0.751294 0.006274
AMT_CREDIT_MAX_OVERDUE 0.001329 0.002290 -0.014724 0.001249 0.000577 0.000999 1.000000 0.001523 0.081663 0.014007 -0.000112 0.015036 -0.000749 0.001578
CNT_CREDIT_PROLONG -0.000388 -0.000740 -0.030460 0.002756 0.113683 0.012017 0.001523 1.000000 -0.008345 -0.001366 0.073805 0.000002 0.017864 -0.000465
AMT_CREDIT_SUM 0.001179 0.007962 0.050883 -0.003292 0.055424 0.059096 0.081663 -0.008345 1.000000 0.683419 0.003756 0.006342 0.104629 0.049146
AMT_CREDIT_SUM_DEBT -0.000790 0.005732 0.135397 -0.002355 0.081298 0.019609 0.014007 -0.001366 0.683419 1.000000 -0.018215 0.008046 0.141235 0.025507
AMT_CREDIT_SUM_LIMIT -0.000304 -0.003986 0.025140 -0.000345 0.095421 0.019476 -0.000112 0.073805 0.003756 -0.018215 1.000000 -0.000687 0.046028 0.004392
AMT_CREDIT_SUM_OVERDUE -0.000014 -0.000499 -0.000383 0.090951 0.001077 -0.000332 0.015036 0.000002 0.006342 0.008046 -0.000687 1.000000 0.003528 0.000344
DAYS_CREDIT_UPDATE 0.000510 0.019398 0.688771 -0.018461 0.248525 0.751294 -0.000749 0.017864 0.104629 0.141235 0.046028 0.003528 1.000000 0.008418
AMT_ANNUITY -0.002727 0.001799 0.005676 -0.000339 0.000475 0.006274 0.001578 -0.000465 0.049146 0.025507 0.004392 0.000344 0.008418 1.000000
In [22]:
plt.subplots(figsize = (15,15))
sns.heatmap(bureau.corr(), cmap = 'viridis')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd1506978d0>
In [23]:
df_bureau_corr = bureau[['DAYS_CREDIT_ENDDATE',
 'DAYS_ENDDATE_FACT',
 'AMT_CREDIT_MAX_OVERDUE',
 'AMT_CREDIT_SUM',
 'AMT_CREDIT_SUM_DEBT',
 'AMT_CREDIT_SUM_LIMIT',
 'AMT_CREDIT_SUM_OVERDUE',
 'AMT_ANNUITY'
]].copy()

# Calculate correlations
corr = df_bureau_corr.corr().abs()
# Heatmap
plt.figure(figsize=(15,8))
sns.heatmap(corr, annot=True, linewidths=.2, cmap="icefire");
In [ ]:
 
In [ ]:
 

Bureau data feature engineering

Bureau data is evaluated with respective of following possible feature categories

  • Active loan vs closed loans of applicants
  • Loan count by various loan types
  • Credit duration
  • Duration by loan, debit
  • Credit limit and average balance
In [24]:
bureau.head()
Out[24]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN
In [25]:
bureau['CREDIT_DURATION'] = -bureau['DAYS_CREDIT'] + bureau['DAYS_CREDIT_ENDDATE']
bureau['ENDDATE_DIF'] = bureau['DAYS_CREDIT_ENDDATE'] - bureau['DAYS_ENDDATE_FACT']
bureau['DEBT_PERCENTAGE'] = bureau['AMT_CREDIT_SUM_DEBT'] / bureau['AMT_CREDIT_SUM']

bureau.head()
Out[25]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY CREDIT_DURATION ENDDATE_DIF DEBT_PERCENTAGE
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN 344.0 0.0 0.00000
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN 1283.0 NaN 0.76152
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN 731.0 NaN NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN NaN NaN NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN 1826.0 NaN NaN
In [168]:
# columns = bureau.columns[bureau.dtypes == 'object']

# rows_to_use  =1
# plots_in_each_columns  = int(len(columns)/rows_to_use)
# fig, axs = plt.subplots(rows_to_use, plots_in_each_columns, figsize=(20,8))
# for i,c in enumerate(columns):
#     sns.countplot(bureau[c], ax = axs[i])
sns.heatmap(bureau.isna(), cmap = 'viridis')
Out[168]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0fa184470>

From above code we can see that Most credit is either Active or Closed. Also, most clinet uses currency1 for loan type. Majority purpose for taking loan is either consumer credit or Credit Card repayment

In [24]:
app_train_merged = app_train.merge(bureau.groupby('SK_ID_CURR').mean().reset_index(), 
                                            left_on='SK_ID_CURR', right_on='SK_ID_CURR', 
                                            how='left', validate='one_to_one')
In [48]:
sns.distplot(app_train_merged[app_train_merged['TARGET'] == 0]['DAYS_CREDIT'],  label='TARGET 0', bins=50, color='b')
sns.distplot(app_train_merged[app_train_merged['TARGET'] == 1]['DAYS_CREDIT'],  label='TARGET 1', bins=50, color='r')
plt.legend()
plt.show()
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
In [29]:
def overdue(x):
    if x < 30:
        return 'A'
    elif x < 60:
        return 'B'
    elif x < 90:
        return 'C'
    elif x < 180:
        return 'D'
    elif x < 365:
        return 'E'
    else:
        return 'F'
In [47]:
fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.kdeplot(app_train_merged[(app_train_merged['TARGET'] == 0) & (app_train_merged['CREDIT_DAY_OVERDUE'] > 30)]['CREDIT_DAY_OVERDUE'].dropna(),  label='TARGET 0',  color='b')
sns.kdeplot(app_train_merged[(app_train_merged['TARGET'] == 1) & (app_train_merged['CREDIT_DAY_OVERDUE'] > 30)]['CREDIT_DAY_OVERDUE'].dropna(),  label='TARGET 1',  color='r')
plt.legend()
plt.show()
In [ ]:
 

Generating Bureau Merged Data

Function to generate file Bureau features to be merged with application train data

In [26]:
def bureau_data_generator(bureau,bureau_balance):
    #bureau, bureau_balance = read_files(['bureau.csv', 'bureau_balance.csv'], path, print_details=False)

    bureau_balance['STATUS'] = bureau_balance['STATUS'].map({'C': -1, 'X' : 0})
    bureau_balance['STATUS'] = pd.to_numeric(bureau_balance['STATUS'],
                                                            errors = 'coerce', )
    
    bb_aggregations = {'MONTHS_BALANCE': ['min', 'max', 'size'],
                       'STATUS': ['min', 'max', 'mean']
                       }
    
    bb_agg = bureau_balance.groupby('SK_ID_BUREAU').agg(bb_aggregations)
    bb_agg.columns = [e[0] + '_' + e[1].upper() for e in bb_agg.columns.to_list()]

    bb_agg.reset_index(inplace=True)

    bureau = bureau.merge(bb_agg, on = 'SK_ID_BUREAU', how = 'left')


    ### Feature Engineering for Bureau Data
    rare_categories = ['Microloan', 'Loan for business development',
                       'Another type of loan','Unknown type of loan',
                       'Loan for working capital replenishment',
                       'Cash loan (non-earmarked)', 'Real estate loan',
                       'Loan for the purchase of equipment',
                       'Loan for purchase of shares (margin lending)',
                       'Interbank credit', 'Mobile operator loan']


    bureau['CREDIT_ACTIVE_Binary'] = bureau['CREDIT_ACTIVE'].apply(lambda x : 0 if x == 'Closed' else 1)
    bureau['CREDIT_ENDDATE_Binary'] = bureau['DAYS_CREDIT_ENDDATE'].apply(lambda x : 1 if x < 0 else 0)
    bureau['CREDIT_DURATION'] = -bureau['DAYS_CREDIT'] + bureau['DAYS_CREDIT_ENDDATE']
    bureau['ENDDATE_DIF'] = bureau['DAYS_CREDIT_ENDDATE'] - bureau['DAYS_ENDDATE_FACT']
    bureau['DEBT_PERCENTAGE'] = bureau['AMT_CREDIT_SUM_DEBT'] / bureau['AMT_CREDIT_SUM']
    bureau['CREDIT_TYPE_CATEGORY'] = bureau['CREDIT_TYPE'].apply(lambda x : 'RARE' if x in rare_categories else x)

    bureau, bureau_cat = one_hot_encoder(bureau, nan_as_category=True, columns = ['CREDIT_TYPE_CATEGORY'])
    cat_aggregations = {}
    for cat in bureau_cat: cat_aggregations[cat] = ['mean', 'sum']

    num_aggregations = {
    'SK_ID_BUREAU' : ['count'],
    'CREDIT_TYPE'  : ['nunique'],
    'CREDIT_ACTIVE_Binary' : ['mean'],
    'CREDIT_ENDDATE_Binary' : ['mean'],
    'DAYS_CREDIT': [ 'mean', 'var'],
    'DAYS_CREDIT_UPDATE': ['mean'],
    'CREDIT_DAY_OVERDUE': ['mean'],
    'AMT_CREDIT_MAX_OVERDUE': ['mean'],
    'AMT_CREDIT_SUM': [ 'mean', 'sum'],
    'AMT_CREDIT_SUM_DEBT': [ 'mean', 'sum'],
    'AMT_CREDIT_SUM_OVERDUE': ['mean'],
    'AMT_CREDIT_SUM_LIMIT': ['mean', 'sum'],
    'AMT_ANNUITY': ['max', 'mean'],
    'CNT_CREDIT_PROLONG': ['sum'],
    'MONTHS_BALANCE_MIN': ['min'],
    'MONTHS_BALANCE_MAX': ['max'],
    'MONTHS_BALANCE_SIZE': ['mean', 'sum'],
    'STATUS_MIN': ['min'],
    'STATUS_MAX': ['max'],
    'STATUS_MEAN': ['mean'],
    'DEBT_PERCENTAGE' : ['mean'],
    'CREDIT_DURATION' : ['mean', 'sum'],
	'ENDDATE_DIF'   : ['mean', 'sum']

    }

    bureau_agg = bureau.groupby('SK_ID_CURR').agg({**num_aggregations, **cat_aggregations})

    bureau_agg.columns = [e[0] + '_' + e[1].upper() for e in bureau_agg.columns.to_list()]

    bureau_agg.columns =  ['LOAN_COUNT', 'LOAN_TYPES', 'ACTIVE_LOANS_PERCENTAGE', 'CREDIT_ENDDATE_PERCENTAGE',
                           'AVG_DAYS_FROM_LAST_APPLICATION', 'VARIANCE_DAYS_FROM_LAST_APPLICATION', 'DAYS_CREDIT_UPDATE_MEAN',
                           'AVG_OVERDUE_DAYS', 'AVG_CREDIT_OVERDUE', 'AVG_CREDIT_AMT', 'TOTAL_CREDIT_AMT', 'AVG_CREDIT_DEBT',
                           'TOTAL_CREDIT_DEBT', 'AVG_CURRENT_AMT_DUE', 'AVG_CREDIT_CARD_LIMIT', 'TOTAL_CREDIT_CARD_LIMIT',
                           'MAX_ANNUITY', 'AVG_ANNUITY', 'TOTAL_PROLONGED_CREDIT', 'MONTHS_BALANCE_MIN', 'MONTHS_BALANCE_MAX',
                           'MONTHS_BALANCE_SIZE_MEAN', 'MONTHS_BALANCE_SIZE_SUM', 'STATUS_MIN', 'STATUS_MAX', 'STATUS_MEAN',
                           'DEBT_PERCENTAGE_MEAN', 'CREDIT_DURATION_MEAN', 'CREDIT_DURATION_SUM', 'ENDDATE_DIF_MEAN',
                           'ENDDATE_DIF_SUM',
                           'AVG_COUNT_CAR_LOAN','TOT_COUNT_CAR_LOAN',
                           'AVG_COUNT_CONSUMER_LOAN', 'TOT_COUNT_CONSUMER_LOAN',
                           'AVG_COUNT_CREDIT_CARD_LOAN', 'TOT_COUNT_CREDIT_CARD_LOAN',
                           'AVG_COUNT_MORTGAGE', 'TOT_COUNT_MORTGAGE',
                           'AVG_COUNT_RARE_LOAN', 'TOT_COUNT_RARE_LOAN',
                           'AVG_COUNT_NAN_LOAN', 'TOT_COUNT_NAN_LOAN']

    bureau_agg.columns = ['BUREAU_' + e for e in bureau_agg.columns.to_list()]
    bureau_agg.reset_index(inplace=True)
    return bureau_agg
In [27]:
df = bureau_data_generator(bureau,bureau_balance)
In [28]:
df.head()
Out[28]:
SK_ID_CURR BUREAU_LOAN_COUNT BUREAU_LOAN_TYPES BUREAU_ACTIVE_LOANS_PERCENTAGE BUREAU_CREDIT_ENDDATE_PERCENTAGE BUREAU_AVG_DAYS_FROM_LAST_APPLICATION BUREAU_VARIANCE_DAYS_FROM_LAST_APPLICATION BUREAU_DAYS_CREDIT_UPDATE_MEAN BUREAU_AVG_OVERDUE_DAYS BUREAU_AVG_CREDIT_OVERDUE BUREAU_AVG_CREDIT_AMT BUREAU_TOTAL_CREDIT_AMT BUREAU_AVG_CREDIT_DEBT BUREAU_TOTAL_CREDIT_DEBT BUREAU_AVG_CURRENT_AMT_DUE BUREAU_AVG_CREDIT_CARD_LIMIT BUREAU_TOTAL_CREDIT_CARD_LIMIT BUREAU_MAX_ANNUITY BUREAU_AVG_ANNUITY BUREAU_TOTAL_PROLONGED_CREDIT BUREAU_MONTHS_BALANCE_MIN BUREAU_MONTHS_BALANCE_MAX BUREAU_MONTHS_BALANCE_SIZE_MEAN BUREAU_MONTHS_BALANCE_SIZE_SUM BUREAU_STATUS_MIN BUREAU_STATUS_MAX BUREAU_STATUS_MEAN BUREAU_DEBT_PERCENTAGE_MEAN BUREAU_CREDIT_DURATION_MEAN BUREAU_CREDIT_DURATION_SUM BUREAU_ENDDATE_DIF_MEAN BUREAU_ENDDATE_DIF_SUM BUREAU_AVG_COUNT_CAR_LOAN BUREAU_TOT_COUNT_CAR_LOAN BUREAU_AVG_COUNT_CONSUMER_LOAN BUREAU_TOT_COUNT_CONSUMER_LOAN BUREAU_AVG_COUNT_CREDIT_CARD_LOAN BUREAU_TOT_COUNT_CREDIT_CARD_LOAN BUREAU_AVG_COUNT_MORTGAGE BUREAU_TOT_COUNT_MORTGAGE BUREAU_AVG_COUNT_RARE_LOAN BUREAU_TOT_COUNT_RARE_LOAN BUREAU_AVG_COUNT_NAN_LOAN BUREAU_TOT_COUNT_NAN_LOAN
0 100001 7 1 0.428571 0.571429 -735.000000 240043.666667 -93.142857 0.0 NaN 207623.571429 1453365.000 85240.928571 596686.5 0.0 0.00000 0.000 10822.5 3545.357143 0 -51.0 0.0 24.571429 172.0 -1.0 0.0 -0.543363 0.282518 817.428571 5722.0 197.0 788.0 0.0 0 1.000000 7 0.000000 0 0.0 0 0.0 0 0 0
1 100002 8 2 0.250000 0.375000 -874.000000 186150.000000 -499.875000 0.0 1681.029 108131.945625 865055.565 49156.200000 245781.0 0.0 7997.14125 31988.565 0.0 0.000000 0 -47.0 0.0 13.750000 110.0 -1.0 0.0 -0.466667 0.136545 719.833333 4319.0 252.6 1263.0 0.0 0 0.500000 4 0.500000 4 0.0 0 0.0 0 0 0
2 100003 4 2 0.250000 0.750000 -1400.750000 827783.583333 -816.000000 0.0 0.000 254350.125000 1017400.500 0.000000 0.0 0.0 202500.00000 810000.000 NaN NaN 0 NaN NaN NaN 0.0 NaN NaN NaN 0.000000 856.250000 3425.0 -34.0 -102.0 0.0 0 0.500000 2 0.500000 2 0.0 0 0.0 0 0 0
3 100004 2 1 0.000000 1.000000 -867.000000 421362.000000 -532.000000 0.0 0.000 94518.900000 189037.800 0.000000 0.0 0.0 0.00000 0.000 NaN NaN 0 NaN NaN NaN 0.0 NaN NaN NaN 0.000000 378.500000 757.0 44.0 88.0 0.0 0 1.000000 2 0.000000 0 0.0 0 0.0 0 0 0
4 100005 3 2 0.666667 0.333333 -190.666667 26340.333333 -54.333333 0.0 0.000 219042.000000 657126.000 189469.500000 568408.5 0.0 0.00000 0.000 4261.5 1420.500000 0 -12.0 0.0 7.000000 21.0 -1.0 0.0 -0.416667 0.601256 630.000000 1890.0 -5.0 -5.0 0.0 0 0.666667 2 0.333333 1 0.0 0 0.0 0 0 0
In [29]:
#df.to_csv("/content/drive/My Drive/I526 Final Project/data/processed/bureau_processed.csv", index = False)
df.to_csv("../Project/data/processed/bureau_processed.csv", index = False)
In [30]:
app_train_i=app_train[['SK_ID_CURR', 'TARGET']]
In [31]:
app_train_bureau_merged = app_train_i.merge(df.reset_index(), 
                                            left_on='SK_ID_CURR', right_on='SK_ID_CURR', 
                                            how='left', validate='one_to_one')
In [32]:
corr_matrix = app_train_bureau_merged.corr()
In [33]:
corr_matrix["TARGET"].sort_values(ascending=False)
Out[33]:
TARGET                                        1.000000
BUREAU_AVG_DAYS_FROM_LAST_APPLICATION         0.089729
BUREAU_ACTIVE_LOANS_PERCENTAGE                0.079369
BUREAU_MONTHS_BALANCE_MIN                     0.073225
BUREAU_DAYS_CREDIT_UPDATE_MEAN                0.068927
BUREAU_STATUS_MEAN                            0.042242
BUREAU_TOT_COUNT_CREDIT_CARD_LOAN             0.034818
BUREAU_AVG_COUNT_RARE_LOAN                    0.034763
BUREAU_AVG_COUNT_CREDIT_CARD_LOAN             0.034684
BUREAU_CREDIT_DURATION_MEAN                   0.033050
BUREAU_STATUS_MIN                             0.032888
BUREAU_TOT_COUNT_RARE_LOAN                    0.032135
BUREAU_CREDIT_DURATION_SUM                    0.030525
BUREAU_STATUS_MAX                             0.011649
BUREAU_AVG_OVERDUE_DAYS                       0.008118
BUREAU_AVG_CURRENT_AMT_DUE                    0.007150
BUREAU_TOTAL_CREDIT_DEBT                      0.007144
BUREAU_MONTHS_BALANCE_MAX                     0.005119
BUREAU_LOAN_TYPES                             0.004624
BUREAU_TOTAL_PROLONGED_CREDIT                 0.004058
BUREAU_LOAN_COUNT                             0.004056
BUREAU_AVG_CREDIT_OVERDUE                     0.002435
BUREAU_DEBT_PERCENTAGE_MEAN                   0.001520
BUREAU_MAX_ANNUITY                            0.001120
BUREAU_AVG_CREDIT_DEBT                       -0.000637
BUREAU_AVG_ANNUITY                           -0.001391
SK_ID_CURR                                   -0.002108
index                                        -0.002677
BUREAU_ENDDATE_DIF_MEAN                      -0.007734
BUREAU_TOTAL_CREDIT_CARD_LIMIT               -0.009419
BUREAU_TOT_COUNT_CONSUMER_LOAN               -0.010707
BUREAU_AVG_CREDIT_CARD_LIMIT                 -0.011446
BUREAU_MONTHS_BALANCE_SIZE_SUM               -0.013638
BUREAU_TOTAL_CREDIT_AMT                      -0.014057
BUREAU_ENDDATE_DIF_SUM                       -0.014294
BUREAU_AVG_CREDIT_AMT                        -0.019957
BUREAU_AVG_COUNT_CAR_LOAN                    -0.020134
BUREAU_TOT_COUNT_CAR_LOAN                    -0.020817
BUREAU_AVG_COUNT_MORTGAGE                    -0.020867
BUREAU_TOT_COUNT_MORTGAGE                    -0.023307
BUREAU_AVG_COUNT_CONSUMER_LOAN               -0.026258
BUREAU_VARIANCE_DAYS_FROM_LAST_APPLICATION   -0.038440
BUREAU_CREDIT_ENDDATE_PERCENTAGE             -0.073288
BUREAU_MONTHS_BALANCE_SIZE_MEAN              -0.080193
BUREAU_AVG_COUNT_NAN_LOAN                          NaN
BUREAU_TOT_COUNT_NAN_LOAN                          NaN
Name: TARGET, dtype: float64
In [34]:
app_train_bureau_merged.head()
Out[34]:
SK_ID_CURR TARGET index BUREAU_LOAN_COUNT BUREAU_LOAN_TYPES BUREAU_ACTIVE_LOANS_PERCENTAGE BUREAU_CREDIT_ENDDATE_PERCENTAGE BUREAU_AVG_DAYS_FROM_LAST_APPLICATION BUREAU_VARIANCE_DAYS_FROM_LAST_APPLICATION BUREAU_DAYS_CREDIT_UPDATE_MEAN BUREAU_AVG_OVERDUE_DAYS BUREAU_AVG_CREDIT_OVERDUE BUREAU_AVG_CREDIT_AMT BUREAU_TOTAL_CREDIT_AMT BUREAU_AVG_CREDIT_DEBT BUREAU_TOTAL_CREDIT_DEBT BUREAU_AVG_CURRENT_AMT_DUE BUREAU_AVG_CREDIT_CARD_LIMIT BUREAU_TOTAL_CREDIT_CARD_LIMIT BUREAU_MAX_ANNUITY BUREAU_AVG_ANNUITY BUREAU_TOTAL_PROLONGED_CREDIT BUREAU_MONTHS_BALANCE_MIN BUREAU_MONTHS_BALANCE_MAX BUREAU_MONTHS_BALANCE_SIZE_MEAN BUREAU_MONTHS_BALANCE_SIZE_SUM BUREAU_STATUS_MIN BUREAU_STATUS_MAX BUREAU_STATUS_MEAN BUREAU_DEBT_PERCENTAGE_MEAN BUREAU_CREDIT_DURATION_MEAN BUREAU_CREDIT_DURATION_SUM BUREAU_ENDDATE_DIF_MEAN BUREAU_ENDDATE_DIF_SUM BUREAU_AVG_COUNT_CAR_LOAN BUREAU_TOT_COUNT_CAR_LOAN BUREAU_AVG_COUNT_CONSUMER_LOAN BUREAU_TOT_COUNT_CONSUMER_LOAN BUREAU_AVG_COUNT_CREDIT_CARD_LOAN BUREAU_TOT_COUNT_CREDIT_CARD_LOAN BUREAU_AVG_COUNT_MORTGAGE BUREAU_TOT_COUNT_MORTGAGE BUREAU_AVG_COUNT_RARE_LOAN BUREAU_TOT_COUNT_RARE_LOAN BUREAU_AVG_COUNT_NAN_LOAN BUREAU_TOT_COUNT_NAN_LOAN
0 100002 1 1.0 8.0 2.0 0.25 0.375 -874.00 186150.000000 -499.875 0.0 1681.029 108131.945625 865055.565 49156.2 245781.0 0.0 7997.14125 31988.565 0.0 0.0 0.0 -47.0 0.0 13.75 110.0 -1.0 0.0 -0.466667 0.136545 719.833333 4319.0 252.6 1263.0 0.0 0.0 0.5 4.0 0.5 4.0 0.0 0.0 0.0 0.0 0.0 0.0
1 100003 0 2.0 4.0 2.0 0.25 0.750 -1400.75 827783.583333 -816.000 0.0 0.000 254350.125000 1017400.500 0.0 0.0 0.0 202500.00000 810000.000 NaN NaN 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.000000 856.250000 3425.0 -34.0 -102.0 0.0 0.0 0.5 2.0 0.5 2.0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 3.0 2.0 1.0 0.00 1.000 -867.00 421362.000000 -532.000 0.0 0.000 94518.900000 189037.800 0.0 0.0 0.0 0.00000 0.000 NaN NaN 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.000000 378.500000 757.0 44.0 88.0 0.0 0.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 100007 0 5.0 1.0 1.0 0.00 1.000 -1149.00 NaN -783.000 0.0 0.000 146250.000000 146250.000 0.0 0.0 0.0 0.00000 0.000 NaN NaN 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.000000 366.000000 366.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [36]:
app_train_numeric = app_train_bureau_merged[ app_train_bureau_merged.dtypes[app_train_bureau_merged.dtypes == 'float64'].index]
numeric_index = app_train_numeric.isna().sum()[app_train_numeric.isna().sum()/len(app_train_numeric) <0.5].index[:5]
app_train_numeric =  app_train_numeric[numeric_index]

app_train_numeric['TARGET'] = app_train_bureau_merged['TARGET']
g = app_train_numeric.groupby('TARGET')
app_train_numeric = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
sns.pairplot(app_train_numeric, hue= 'TARGET')
Out[36]:
<seaborn.axisgrid.PairGrid at 0x7fd136fd6550>
In [42]:
app_train_ic = app_train_bureau_merged[['TARGET', 'BUREAU_AVG_DAYS_FROM_LAST_APPLICATION', 'BUREAU_MONTHS_BALANCE_MIN']]


fig,axs = plt.subplots(2,2,figsize = (15,15))

# plt.figure(figsize=(12,5))
# plt.title("Distribution of AMT_CREDIT")
sns.distplot((app_train_ic["BUREAU_AVG_DAYS_FROM_LAST_APPLICATION"]), ax = axs[0,0])
axs[0,0].set_title('Avg days diff from last application')
axs[0,0].set_xlabel(' Days diff from last app')
sns.boxplot(y =(app_train_ic["BUREAU_AVG_DAYS_FROM_LAST_APPLICATION"]), x = app_train_ic['TARGET'], ax = axs[0,1])
axs[0,1].set_title(' Distirbution of Days diff from last app - Box Plot')

sns.distplot((app_train_ic["BUREAU_MONTHS_BALANCE_MIN"]), ax = axs[1,0])
axs[1,0].set_title(' Distirbution of Month Min Balance')
axs[1,0].set_xlabel('Logarithmic Month Min Balance')
sns.boxplot(y = (app_train_ic["BUREAU_MONTHS_BALANCE_MIN"]), x = app_train_ic['TARGET'], ax = axs[1,1])
axs[1,1].set_title(' Distirbution of Month Min Balance - Box Plot')
Out[42]:
Text(0.5, 1.0, ' Distirbution of Month Min Balance - Box Plot')
In [ ]:
 

Previous application data

Previous application data description

Pervious application files have one record for each previous application for give applicant in application_train data. There is one-to-many relationship between application_train and previous_application based on SK_ID_CUU as primary key in application train and foreign key in previous application dataset. SK_ID_PREV is primary key of previous_application dataset.

Previos applicaiton dataset has information about previous loan parameters and applicant details at time of previous loan. Some of features in previous application dataset are

  • NAME_CONTRACT_TYPE -Contract product type (Cash loan, consumer loan [POS] ,...) of the previous application
  • AMT_ANNUITY -Annuity of previous application
  • AMT_APPLICATION -For how much credit did client ask on the previous application
  • AMT_CREDIT - Final credit amount on the previous application.
  • NAME_CONTRACT_STATUS- Contract status (approved, cancelled, ...) of previous application
  • NAME_CLIENT_TYPE- Was the client old or new client when applying for the previous application
  • NAME_CASH_LOAN_PURPOSE- Purpose of the cash loan

Previous application data EDA

In [116]:
file_list = ['application_train.csv','previous_application.csv']


#app_test, app_train, installment, bureau_balance, pos_cash, bureau, prev_app, cc_bal = read_files(file_list, path, print_details=False)
app_train,prev_app= read_files(file_list, path, print_details=False)
1it [00:04,  4.17s/it]
0it [00:00, ?it/s]
application_train.csv ... Read Completed
****************************************
4it [00:09,  2.29s/it]
previous_application.csv ... Read Completed
****************************************

In [70]:
prev_app.head(10)
Out[70]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 Y 1 0.0 0.182832 0.867336 XAP Approved -73 Cash through the bank XAP NaN Repeater Mobile POS XNA Country-wide 35 Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 Y 1 NaN NaN NaN XNA Approved -164 XNA XAP Unaccompanied Repeater XNA Cash x-sell Contact center -1 XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 Y 1 NaN NaN NaN XNA Approved -301 Cash through the bank XAP Spouse, partner Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 Y 1 NaN NaN NaN XNA Approved -512 Cash through the bank XAP NaN Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 Y 1 NaN NaN NaN Repairs Refused -781 Cash through the bank HC NaN Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN
5 1383531 199383 Cash loans 23703.930 315000.0 340573.5 NaN 315000.0 SATURDAY 8 Y 1 NaN NaN NaN Everyday expenses Approved -684 Cash through the bank XAP Family Repeater XNA Cash x-sell Credit and cash offices -1 XNA 18.0 low_normal Cash X-Sell: low 365243.0 -654.0 -144.0 -144.0 -137.0 1.0
6 2315218 175704 Cash loans NaN 0.0 0.0 NaN NaN TUESDAY 11 Y 1 NaN NaN NaN XNA Canceled -14 XNA XAP NaN Repeater XNA XNA XNA Credit and cash offices -1 XNA NaN XNA Cash NaN NaN NaN NaN NaN NaN
7 1656711 296299 Cash loans NaN 0.0 0.0 NaN NaN MONDAY 7 Y 1 NaN NaN NaN XNA Canceled -21 XNA XAP NaN Repeater XNA XNA XNA Credit and cash offices -1 XNA NaN XNA Cash NaN NaN NaN NaN NaN NaN
8 2367563 342292 Cash loans NaN 0.0 0.0 NaN NaN MONDAY 15 Y 1 NaN NaN NaN XNA Canceled -386 XNA XAP NaN Repeater XNA XNA XNA Credit and cash offices -1 XNA NaN XNA Cash NaN NaN NaN NaN NaN NaN
9 2579447 334349 Cash loans NaN 0.0 0.0 NaN NaN SATURDAY 15 Y 1 NaN NaN NaN XNA Canceled -57 XNA XAP NaN Repeater XNA XNA XNA Credit and cash offices -1 XNA NaN XNA Cash NaN NaN NaN NaN NaN NaN
In [108]:
prev_app.shape
Out[108]:
(1670214, 37)
In [9]:
prev_app.dtypes.value_counts()
Out[9]:
object     16
float64    15
int64       6
dtype: int64
In [10]:
prev_app.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
Out[10]:
NAME_CONTRACT_TYPE              4
WEEKDAY_APPR_PROCESS_START      7
FLAG_LAST_APPL_PER_CONTRACT     2
NAME_CASH_LOAN_PURPOSE         25
NAME_CONTRACT_STATUS            4
NAME_PAYMENT_TYPE               4
CODE_REJECT_REASON              9
NAME_TYPE_SUITE                 7
NAME_CLIENT_TYPE                4
NAME_GOODS_CATEGORY            28
NAME_PORTFOLIO                  5
NAME_PRODUCT_TYPE               3
CHANNEL_TYPE                    8
NAME_SELLER_INDUSTRY           11
NAME_YIELD_GROUP                5
PRODUCT_COMBINATION            17
dtype: int64
In [11]:
prev_app.select_dtypes('int').apply(pd.Series.nunique, axis = 0)
Out[11]:
SK_ID_PREV                 1670214
SK_ID_CURR                  338857
HOUR_APPR_PROCESS_START         24
NFLAG_LAST_APPL_IN_DAY           2
DAYS_DECISION                 2922
SELLERPLACE_AREA              2097
dtype: int64
In [12]:
prev_app_counts = prev_app.groupby('SK_ID_CURR', as_index=False)['SK_ID_PREV'].count()
In [13]:
prev_app_counts.head(10)
Out[13]:
SK_ID_CURR SK_ID_PREV
0 100001 1
1 100002 1
2 100003 3
3 100004 1
4 100005 2
5 100006 9
6 100007 6
7 100008 5
8 100009 7
9 100010 1
In [14]:
missingdata(prev_app,20)

Number of columns with over 20 percent missing values 
-------------------
14
Out[14]:
Total Percent
RATE_INTEREST_PRIVILEGED 1664263 99.643698
RATE_INTEREST_PRIMARY 1664263 99.643698
RATE_DOWN_PAYMENT 895844 53.636480
AMT_DOWN_PAYMENT 895844 53.636480
NAME_TYPE_SUITE 820405 49.119754
DAYS_TERMINATION 673065 40.298129
NFLAG_INSURED_ON_APPROVAL 673065 40.298129
DAYS_FIRST_DRAWING 673065 40.298129
DAYS_FIRST_DUE 673065 40.298129
DAYS_LAST_DUE_1ST_VERSION 673065 40.298129
DAYS_LAST_DUE 673065 40.298129
AMT_GOODS_PRICE 385515 23.081773
AMT_ANNUITY 372235 22.286665
CNT_PAYMENT 372230 22.286366
In [15]:
cat_list=prev_app.select_dtypes(np.object).columns
cat_tar_list=cat_list.to_list()
#cat_tar_list.append('TARGET')
cat_tar_list
Out[15]:
['NAME_CONTRACT_TYPE',
 'WEEKDAY_APPR_PROCESS_START',
 'FLAG_LAST_APPL_PER_CONTRACT',
 'NAME_CASH_LOAN_PURPOSE',
 'NAME_CONTRACT_STATUS',
 'NAME_PAYMENT_TYPE',
 'CODE_REJECT_REASON',
 'NAME_TYPE_SUITE',
 'NAME_CLIENT_TYPE',
 'NAME_GOODS_CATEGORY',
 'NAME_PORTFOLIO',
 'NAME_PRODUCT_TYPE',
 'CHANNEL_TYPE',
 'NAME_SELLER_INDUSTRY',
 'NAME_YIELD_GROUP',
 'PRODUCT_COMBINATION']
In [16]:
len(cat_tar_list)
Out[16]:
16
In [17]:
fig, ax = plt.subplots(4, 4, figsize=(40, 40))
plt.subplots_adjust(left=None, bottom=None, right=None,
                    top=None, wspace=None, hspace=0.45)
num = 0
for i in range(0, 4):
    for j in range(0, 4):
        tst = sns.countplot(x=cat_tar_list[num],
                           data=prev_app, ax=ax[i][j])
        tst.set_title(f"Distribution by {cat_tar_list[num]}  Variable")
        tst.set_xticklabels(tst.get_xticklabels(), rotation=25)
        num = num + 1
In [74]:
columns = prev_app.columns[prev_app.dtypes == 'float64']
In [75]:
columns.tolist()
Out[75]:
['AMT_ANNUITY',
 'AMT_APPLICATION',
 'AMT_CREDIT',
 'AMT_DOWN_PAYMENT',
 'AMT_GOODS_PRICE',
 'RATE_DOWN_PAYMENT',
 'RATE_INTEREST_PRIMARY',
 'RATE_INTEREST_PRIVILEGED',
 'CNT_PAYMENT',
 'DAYS_FIRST_DRAWING',
 'DAYS_FIRST_DUE',
 'DAYS_LAST_DUE_1ST_VERSION',
 'DAYS_LAST_DUE',
 'DAYS_TERMINATION',
 'NFLAG_INSURED_ON_APPROVAL']
In [76]:
corr_matrix = prev_app.corr()
In [77]:
corr_matrix
Out[77]:
SK_ID_PREV SK_ID_CURR AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE HOUR_APPR_PROCESS_START NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED DAYS_DECISION SELLERPLACE_AREA CNT_PAYMENT DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
SK_ID_PREV 1.000000 -0.000321 0.011459 0.003302 0.003659 -0.001313 0.015293 -0.002652 -0.002828 -0.004051 0.012969 -0.022312 0.019100 -0.001079 0.015589 -0.001478 -0.000071 0.001222 0.001915 0.001781 0.003986
SK_ID_CURR -0.000321 1.000000 0.000577 0.000280 0.000195 -0.000063 0.000369 0.002842 0.000098 0.001158 0.033197 -0.016757 -0.000637 0.001265 0.000031 -0.001329 -0.000757 0.000252 -0.000318 -0.000020 0.000876
AMT_ANNUITY 0.011459 0.000577 1.000000 0.808872 0.816429 0.267694 0.820895 -0.036201 0.020639 -0.103878 0.141823 -0.202335 0.279051 -0.015027 0.394535 0.052839 -0.053295 -0.068877 0.082659 0.068022 0.283080
AMT_APPLICATION 0.003302 0.000280 0.808872 1.000000 0.975824 0.482776 0.999884 -0.014415 0.004310 -0.072479 0.110001 -0.199733 0.133660 -0.007649 0.680630 0.074544 -0.049532 -0.084905 0.172627 0.148618 0.259219
AMT_CREDIT 0.003659 0.000195 0.816429 0.975824 1.000000 0.301284 0.993087 -0.021039 -0.025179 -0.188128 0.125106 -0.205158 0.133763 -0.009567 0.674278 -0.036813 0.002881 0.044031 0.224829 0.214320 0.263932
AMT_DOWN_PAYMENT -0.001313 -0.000063 0.267694 0.482776 0.301284 1.000000 0.482776 0.016776 0.001597 0.473935 0.016323 -0.115343 -0.024536 0.003533 0.031659 -0.001773 -0.013586 -0.000869 -0.031425 -0.030702 -0.042585
AMT_GOODS_PRICE 0.015293 0.000369 0.820895 0.999884 0.993087 0.482776 1.000000 -0.045267 -0.017100 -0.072479 0.110001 -0.199733 0.290422 -0.015842 0.672129 -0.024445 -0.021062 0.016883 0.211696 0.209296 0.243400
HOUR_APPR_PROCESS_START -0.002652 0.002842 -0.036201 -0.014415 -0.021039 0.016776 -0.045267 1.000000 0.005789 0.025930 -0.027172 -0.045720 -0.039962 0.015671 -0.055511 0.014321 -0.002797 -0.016567 -0.018018 -0.018254 -0.117318
NFLAG_LAST_APPL_IN_DAY -0.002828 0.000098 0.020639 0.004310 -0.025179 0.001597 -0.017100 0.005789 1.000000 0.004554 0.009604 0.024640 0.016555 0.000912 0.063347 -0.000409 -0.002288 -0.001981 -0.002277 -0.000744 -0.007124
RATE_DOWN_PAYMENT -0.004051 0.001158 -0.103878 -0.072479 -0.188128 0.473935 -0.072479 0.025930 0.004554 1.000000 -0.103373 -0.106143 -0.208742 -0.006489 -0.278875 -0.007969 -0.039178 -0.010934 -0.147562 -0.145461 -0.021633
RATE_INTEREST_PRIMARY 0.012969 0.033197 0.141823 0.110001 0.125106 0.016323 0.110001 -0.027172 0.009604 -0.103373 1.000000 -0.001937 0.014037 0.159182 -0.019030 NaN -0.017171 -0.000933 -0.010677 -0.011099 0.311938
RATE_INTEREST_PRIVILEGED -0.022312 -0.016757 -0.202335 -0.199733 -0.205158 -0.115343 -0.199733 -0.045720 0.024640 -0.106143 -0.001937 1.000000 0.631940 -0.066316 -0.057150 NaN 0.150904 0.030513 0.372214 0.378671 -0.067157
DAYS_DECISION 0.019100 -0.000637 0.279051 0.133660 0.133763 -0.024536 0.290422 -0.039962 0.016555 -0.208742 0.014037 0.631940 1.000000 -0.018382 0.246453 -0.012007 0.176711 0.089167 0.448549 0.400179 -0.028905
SELLERPLACE_AREA -0.001079 0.001265 -0.015027 -0.007649 -0.009567 0.003533 -0.015842 0.015671 0.000912 -0.006489 0.159182 -0.066316 -0.018382 1.000000 -0.010646 0.007401 -0.002166 -0.007510 -0.006291 -0.006675 -0.018280
CNT_PAYMENT 0.015589 0.000031 0.394535 0.680630 0.674278 0.031659 0.672129 -0.055511 0.063347 -0.278875 -0.019030 -0.057150 0.246453 -0.010646 1.000000 0.309900 -0.204907 -0.381013 0.088903 0.055121 0.320520
DAYS_FIRST_DRAWING -0.001478 -0.001329 0.052839 0.074544 -0.036813 -0.001773 -0.024445 0.014321 -0.000409 -0.007969 NaN NaN -0.012007 0.007401 0.309900 1.000000 0.004710 -0.803494 -0.257466 -0.396284 0.177652
DAYS_FIRST_DUE -0.000071 -0.000757 -0.053295 -0.049532 0.002881 -0.013586 -0.021062 -0.002797 -0.002288 -0.039178 -0.017171 0.150904 0.176711 -0.002166 -0.204907 0.004710 1.000000 0.513949 0.401838 0.323608 -0.119048
DAYS_LAST_DUE_1ST_VERSION 0.001222 0.000252 -0.068877 -0.084905 0.044031 -0.000869 0.016883 -0.016567 -0.001981 -0.010934 -0.000933 0.030513 0.089167 -0.007510 -0.381013 -0.803494 0.513949 1.000000 0.423462 0.493174 -0.221947
DAYS_LAST_DUE 0.001915 -0.000318 0.082659 0.172627 0.224829 -0.031425 0.211696 -0.018018 -0.002277 -0.147562 -0.010677 0.372214 0.448549 -0.006291 0.088903 -0.257466 0.401838 0.423462 1.000000 0.927990 0.012560
DAYS_TERMINATION 0.001781 -0.000020 0.068022 0.148618 0.214320 -0.030702 0.209296 -0.018254 -0.000744 -0.145461 -0.011099 0.378671 0.400179 -0.006675 0.055121 -0.396284 0.323608 0.493174 0.927990 1.000000 -0.003065
NFLAG_INSURED_ON_APPROVAL 0.003986 0.000876 0.283080 0.259219 0.263932 -0.042585 0.243400 -0.117318 -0.007124 -0.021633 0.311938 -0.067157 -0.028905 -0.018280 0.320520 0.177652 -0.119048 -0.221947 0.012560 -0.003065 1.000000
In [79]:
plt.subplots(figsize = (15,15))
sns.heatmap(prev_app.corr(), cmap = 'viridis')
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0ea65b9e8>
In [81]:
df_prev_app_corr = prev_app[['AMT_ANNUITY',
 'AMT_APPLICATION',
 'AMT_CREDIT',
 'AMT_DOWN_PAYMENT',
 'AMT_GOODS_PRICE',
 'RATE_DOWN_PAYMENT',
 'RATE_INTEREST_PRIMARY',
 'RATE_INTEREST_PRIVILEGED',
 'CNT_PAYMENT'
]].copy()

# Calculate correlations
corr = df_prev_app_corr.corr().abs()
# Heatmap
plt.figure(figsize=(15,8))
sns.heatmap(corr, annot=True, linewidths=.2, cmap="icefire");
In [ ]:
 

Previous application feature engineering

In [38]:
def cat_columns_value_dist(df,column_name):
    print("Column:",column_name)
       
    print(pd.DataFrame({'counts': df[column_name].value_counts(),
              'pct': round(df[column_name].value_counts() / len(df[column_name]),2)
             }))
In [39]:
cat_columns_value_dist(prev_app,'NAME_CONTRACT_TYPE')
Column: NAME_CONTRACT_TYPE
                 counts   pct
Cash loans       747553  0.45
Consumer loans   729151  0.44
Revolving loans  193164  0.12
XNA                 346  0.00
In [ ]:
 
In [40]:
cat_columns_value_dist(prev_app,'NAME_CASH_LOAN_PURPOSE')
Column: NAME_CASH_LOAN_PURPOSE
                                  counts   pct
XAP                               922661  0.55
XNA                               677918  0.41
Repairs                            23765  0.01
Other                              15608  0.01
Urgent needs                        8412  0.01
Buying a used car                   2888  0.00
Building a house or an annex        2693  0.00
Everyday expenses                   2416  0.00
Medicine                            2174  0.00
Payments on other loans             1931  0.00
Education                           1573  0.00
Journey                             1239  0.00
Purchase of electronic equipment    1061  0.00
Buying a new car                    1012  0.00
Wedding / gift / holiday             962  0.00
Buying a home                        865  0.00
Car repairs                          797  0.00
Furniture                            749  0.00
Buying a holiday home / land         533  0.00
Business development                 426  0.00
Gasification / water supply          300  0.00
Buying a garage                      136  0.00
Hobby                                 55  0.00
Money for a third person              25  0.00
Refusal to name the goal              15  0.00
In [41]:
cat_columns_value_dist(prev_app,'NAME_PAYMENT_TYPE')
Column: NAME_PAYMENT_TYPE
                                            counts   pct
Cash through the bank                      1033552  0.62
XNA                                         627384  0.38
Non-cash from your account                    8193  0.00
Cashless from the account of the employer     1085  0.00
In [42]:
cat_columns_value_dist(prev_app,'NAME_TYPE_SUITE')
Column: NAME_TYPE_SUITE
                 counts   pct
Unaccompanied    508970  0.30
Family           213263  0.13
Spouse, partner   67069  0.04
Children          31566  0.02
Other_B           17624  0.01
Other_A            9077  0.01
Group of people    2240  0.00
In [43]:
cat_columns_value_dist(prev_app,'NAME_GOODS_CATEGORY')
Column: NAME_GOODS_CATEGORY
                          counts   pct
XNA                       950809  0.57
Mobile                    224708  0.13
Consumer Electronics      121576  0.07
Computers                 105769  0.06
Audio/Video                99441  0.06
Furniture                  53656  0.03
Photo / Cinema Equipment   25021  0.01
Construction Materials     24995  0.01
Clothing and Accessories   23554  0.01
Auto Accessories            7381  0.00
Jewelry                     6290  0.00
Homewares                   5023  0.00
Medical Supplies            3843  0.00
Vehicles                    3370  0.00
Sport and Leisure           2981  0.00
Gardening                   2668  0.00
Other                       2554  0.00
Office Appliances           2333  0.00
Tourism                     1659  0.00
Medicine                    1550  0.00
Direct Sales                 446  0.00
Fitness                      209  0.00
Additional Service           128  0.00
Education                    107  0.00
Weapon                        77  0.00
Insurance                     64  0.00
House Construction             1  0.00
Animals                        1  0.00
In [44]:
cat_columns_value_dist(prev_app,'NAME_PORTFOLIO')
Column: NAME_PORTFOLIO
       counts   pct
POS    691011  0.41
Cash   461563  0.28
XNA    372230  0.22
Cards  144985  0.09
Cars      425  0.00
In [45]:
cat_columns_value_dist(prev_app,'NAME_PRODUCT_TYPE')
Column: NAME_PRODUCT_TYPE
          counts   pct
XNA      1063666  0.64
x-sell    456287  0.27
walk-in   150261  0.09
In [49]:
cat_columns_value_dist(prev_app,'PRODUCT_COMBINATION')
Column: PRODUCT_COMBINATION
                                counts   pct
Cash                            285990  0.17
POS household with interest     263622  0.16
POS mobile with interest        220670  0.13
Cash X-Sell: middle             143883  0.09
Cash X-Sell: low                130248  0.08
Card Street                     112582  0.07
POS industry with interest       98833  0.06
POS household without interest   82908  0.05
Card X-Sell                      80582  0.05
Cash Street: high                59639  0.04
Cash X-Sell: high                59301  0.04
Cash Street: middle              34658  0.02
Cash Street: low                 33834  0.02
POS mobile without interest      24082  0.01
POS other with interest          23879  0.01
POS industry without interest    12602  0.01
POS others without interest       2555  0.00
In [34]:
prev_app['NAME_CONTRACT_TYPE_CAT'] = prev_app['NAME_CONTRACT_TYPE'].apply(lambda x : 'CASH_LOAD' if x =='Cash loans' else ('CONSUMER_LOAN' if x == 'Consumer loans' else 'OTHER'))
In [36]:
prev_app['NAME_CASH_LOAN_PURPOSE_CAT'] = prev_app['NAME_CASH_LOAN_PURPOSE'].apply(lambda x : 'XAP' if x =='XAP' else ('XNA' if x == 'XNA' else 'OTHER'))
In [38]:
prev_app['NAME_PAYMENT_TYPE_CAT'] = prev_app['NAME_PAYMENT_TYPE'].apply(lambda x : 'CASH_THROUGH_BANK' if x =='Cash through the bank' else ('XNA' if x == 'XNA' else 'OTHER'))
In [40]:
prev_app['NAME_TYPE_SUITE_CAT'] = prev_app['NAME_TYPE_SUITE'].apply(lambda x : 'Unaccompanied' if x =='Unaccompanied' else 'accompanied')
In [43]:
prev_app['NAME_GOODS_CATEGORY_CAT'] = prev_app['NAME_GOODS_CATEGORY'].apply(lambda x : 'XNA' if x =='XNA' else ('Mobile' if x == 'Mobile' else 'OTHER'))
In [45]:
prev_app['NAME_PORTFOLIO_CAT'] = prev_app['NAME_PORTFOLIO'].apply(lambda x : 'POS' if x =='POS' else ('Cash' if x == 'Cash' else ('XNA' if x == 'XNA' else 'OTHER')))
In [47]:
prev_app['NAME_PRODUCT_TYPE_CAT'] = prev_app['NAME_PRODUCT_TYPE'].apply(lambda x : 'XNA' if x =='XNA' else ('XSELL' if x == 'x-sell' else 'OTHER'))
In [ ]:
 
In [55]:
prev_app, prev_app_cat = one_hot_encoder(prev_app, nan_as_category=False, columns = ['NAME_CONTRACT_TYPE_CAT','NAME_CASH_LOAN_PURPOSE_CAT',
       'NAME_PAYMENT_TYPE_CAT', 'NAME_TYPE_SUITE_CAT',
       'NAME_GOODS_CATEGORY_CAT', 'NAME_PORTFOLIO_CAT',
       'NAME_PRODUCT_TYPE_CAT','NAME_CONTRACT_STATUS'])
In [56]:
prev_app.head()
Out[56]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START ... NAME_PORTFOLIO_CAT_OTHER NAME_PORTFOLIO_CAT_POS NAME_PORTFOLIO_CAT_XNA NAME_PRODUCT_TYPE_CAT_OTHER NAME_PRODUCT_TYPE_CAT_XNA NAME_PRODUCT_TYPE_CAT_XSELL NAME_CONTRACT_STATUS_Approved NAME_CONTRACT_STATUS_Canceled NAME_CONTRACT_STATUS_Refused NAME_CONTRACT_STATUS_Unused offer
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 ... 0 1 0 0 1 0 1 0 0 0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 ... 0 0 0 0 0 1 1 0 0 0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 ... 0 0 0 0 0 1 1 0 0 0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 ... 0 0 0 0 0 1 1 0 0 0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 ... 0 0 0 1 0 0 0 0 1 0

5 rows × 61 columns

In [ ]:
 

Generating Previous application features

In [122]:
def prevapp_features(prev_app):
    prev_app['NAME_CONTRACT_TYPE_CAT'] = prev_app['NAME_CONTRACT_TYPE'].apply(lambda x : 'CASH_LOAD' if x =='Cash loans' else ('CONSUMER_LOAN' if x == 'Consumer loans' else 'OTHER'))
    prev_app['NAME_CASH_LOAN_PURPOSE_CAT'] = prev_app['NAME_CASH_LOAN_PURPOSE'].apply(lambda x : 'XAP' if x =='XAP' else ('XNA' if x == 'XNA' else 'OTHER'))
    prev_app['NAME_PAYMENT_TYPE_CAT'] = prev_app['NAME_PAYMENT_TYPE'].apply(lambda x : 'CASH_THROUGH_BANK' if x =='Cash through the bank' else ('XNA' if x == 'XNA' else 'OTHER'))
    prev_app['NAME_TYPE_SUITE_CAT'] = prev_app['NAME_TYPE_SUITE'].apply(lambda x : 'Unaccompanied' if x =='Unaccompanied' else 'accompanied')
    prev_app['NAME_GOODS_CATEGORY_CAT'] = prev_app['NAME_GOODS_CATEGORY'].apply(lambda x : 'XNA' if x =='XNA' else ('Mobile' if x == 'Mobile' else 'OTHER'))
    prev_app['NAME_PORTFOLIO_CAT'] = prev_app['NAME_PORTFOLIO'].apply(lambda x : 'POS' if x =='POS' else ('Cash' if x == 'Cash' else ('XNA' if x == 'XNA' else 'OTHER')))
    prev_app['NAME_PRODUCT_TYPE_CAT'] = prev_app['NAME_PRODUCT_TYPE'].apply(lambda x : 'XNA' if x =='XNA' else ('XSELL' if x == 'x-sell' else 'OTHER'))
    
    prev_app, prev_app_cat = one_hot_encoder(prev_app, nan_as_category=False, columns = ['NAME_CONTRACT_TYPE_CAT','NAME_CASH_LOAN_PURPOSE_CAT',
       'NAME_PAYMENT_TYPE_CAT', 'NAME_TYPE_SUITE_CAT',
       'NAME_GOODS_CATEGORY_CAT', 'NAME_PORTFOLIO_CAT',
       'NAME_PRODUCT_TYPE_CAT','NAME_CONTRACT_STATUS'])
    
    num_aggregations = {
    'SK_ID_PREV' : ['count'],
    'NAME_CONTRACT_TYPE'  : ['nunique'],
    'AMT_ANNUITY': [ 'mean', 'sum'],
    'AMT_APPLICATION': [ 'mean', 'sum'],
    'AMT_CREDIT': [ 'mean', 'sum'],
    'AMT_DOWN_PAYMENT': [ 'mean', 'sum'],
    'AMT_GOODS_PRICE': [ 'mean', 'sum'],
    'CNT_PAYMENT': [ 'mean', 'sum','min','max'],
    'NAME_CONTRACT_STATUS_Approved': ['sum'],
    'NAME_CONTRACT_STATUS_Canceled': ['sum'],
    'NAME_CONTRACT_STATUS_Refused': ['sum'],
    'NAME_CONTRACT_STATUS_Unused offer': ['sum'],
     'NAME_CONTRACT_TYPE_CAT_CASH_LOAD' :  ['sum'] ,
     'NAME_CONTRACT_TYPE_CAT_CONSUMER_LOAN' :  ['sum'] ,
     'NAME_CONTRACT_TYPE_CAT_OTHER' :  ['sum'] ,
     'NAME_CASH_LOAN_PURPOSE_CAT_OTHER' :  ['sum'] ,
     'NAME_CASH_LOAN_PURPOSE_CAT_XAP' :  ['sum'] ,
     'NAME_CASH_LOAN_PURPOSE_CAT_XNA' :  ['sum'] ,
     'NAME_PAYMENT_TYPE_CAT_CASH_THROUGH_BANK' :  ['sum'] ,
     'NAME_PAYMENT_TYPE_CAT_OTHER' :  ['sum'] ,
     'NAME_PAYMENT_TYPE_CAT_XNA' :  ['sum'] ,
     'NAME_TYPE_SUITE_CAT_Unaccompanied' :  ['sum'] ,
     'NAME_TYPE_SUITE_CAT_accompanied' :  ['sum'] ,
     'NAME_GOODS_CATEGORY_CAT_Mobile' :  ['sum'] ,
     'NAME_GOODS_CATEGORY_CAT_OTHER' :  ['sum'] ,
     'NAME_GOODS_CATEGORY_CAT_XNA' :  ['sum'] ,
     'NAME_PORTFOLIO_CAT_Cash' :  ['sum'] ,
     'NAME_PORTFOLIO_CAT_OTHER' :  ['sum'] ,
     'NAME_PORTFOLIO_CAT_POS' :  ['sum'] ,
     'NAME_PORTFOLIO_CAT_XNA' :  ['sum'] ,
     'NAME_PRODUCT_TYPE_CAT_OTHER' :  ['sum'] ,
     'NAME_PRODUCT_TYPE_CAT_XNA' :  ['sum'] ,
     'NAME_PRODUCT_TYPE_CAT_XSELL' :  ['sum'] ,
     'NAME_CONTRACT_STATUS_Approved' :  ['sum'] ,
     'NAME_CONTRACT_STATUS_Canceled' :  ['sum'] ,
     'NAME_CONTRACT_STATUS_Refused' :  ['sum'] ,
     'NAME_CONTRACT_STATUS_Unused offer' :  ['sum']
    }
    prev_app_agg = prev_app.groupby('SK_ID_CURR').agg({**num_aggregations})
    prev_app_agg['NAME_CONTRACT_STATUS_REFUSED_RATIO'] =  prev_app_agg['NAME_CONTRACT_STATUS_Refused']/(prev_app_agg['NAME_CONTRACT_STATUS_Canceled']+prev_app_agg['NAME_CONTRACT_STATUS_Approved']+prev_app_agg['NAME_CONTRACT_STATUS_Canceled']+prev_app_agg['NAME_CONTRACT_STATUS_Unused offer'])
    
    prev_app_agg.columns = [e[0] + '_' + e[1].upper() for e in prev_app_agg.columns.to_list()]
    prev_app_agg.columns = ['PREV_APP_' + e for e in prev_app_agg.columns.to_list()]
    prev_app_agg.reset_index(inplace=True)
    prev_app_agg.to_csv("../Project/data/processed/previous_application_processed.csv", index = False)
    
    return prev_app_agg
In [123]:
df_prev_app=prevapp_features(prev_app)
In [134]:
app_train_i=app_train[['SK_ID_CURR', 'TARGET']]
In [137]:
app_train_prev_app_merged = app_train_i.merge(df_prev_app.reset_index(), 
                                            left_on='SK_ID_CURR', right_on='SK_ID_CURR', 
                                            how='left', validate='one_to_one')
In [140]:
corr_matrix = app_train_prev_app_merged.corr()
In [141]:
corr_matrix["TARGET"].sort_values(ascending=False)
Out[141]:
TARGET                                                  1.000000
PREV_APP_NAME_CONTRACT_STATUS_REFUSED_RATIO_            0.068409
PREV_APP_NAME_CONTRACT_STATUS_Refused_SUM               0.064469
PREV_APP_NAME_PRODUCT_TYPE_CAT_OTHER_SUM                0.062628
PREV_APP_NAME_CASH_LOAN_PURPOSE_CAT_OTHER_SUM           0.049403
PREV_APP_NAME_CONTRACT_TYPE_CAT_OTHER_SUM               0.045997
PREV_APP_NAME_PORTFOLIO_CAT_OTHER_SUM                   0.041221
PREV_APP_NAME_PAYMENT_TYPE_CAT_XNA_SUM                  0.039469
PREV_APP_NAME_GOODS_CATEGORY_CAT_XNA_SUM                0.032688
PREV_APP_NAME_PORTFOLIO_CAT_XNA_SUM                     0.032270
PREV_APP_CNT_PAYMENT_MAX                                0.029439
PREV_APP_CNT_PAYMENT_MEAN                               0.027743
PREV_APP_CNT_PAYMENT_SUM                                0.027716
PREV_APP_NAME_TYPE_SUITE_CAT_accompanied_SUM            0.025195
PREV_APP_NAME_CONTRACT_TYPE_NUNIQUE                     0.024550
PREV_APP_NAME_CONTRACT_TYPE_CAT_CASH_LOAD_SUM           0.022676
PREV_APP_SK_ID_PREV_COUNT                               0.019762
PREV_APP_NAME_CONTRACT_STATUS_Canceled_SUM              0.019031
PREV_APP_NAME_PORTFOLIO_CAT_Cash_SUM                    0.015628
PREV_APP_NAME_CASH_LOAN_PURPOSE_CAT_XNA_SUM             0.012536
PREV_APP_NAME_GOODS_CATEGORY_CAT_Mobile_SUM             0.010935
PREV_APP_AMT_CREDIT_SUM                                 0.008308
PREV_APP_NAME_PRODUCT_TYPE_CAT_XNA_SUM                  0.007036
PREV_APP_AMT_GOODS_PRICE_SUM                            0.004662
PREV_APP_AMT_APPLICATION_SUM                            0.004607
PREV_APP_NAME_CASH_LOAN_PURPOSE_CAT_XAP_SUM             0.004065
PREV_APP_NAME_TYPE_SUITE_CAT_Unaccompanied_SUM          0.002582
PREV_APP_NAME_CONTRACT_STATUS_Unused offer_SUM          0.000517
PREV_APP_NAME_PAYMENT_TYPE_CAT_OTHER_SUM                0.000014
SK_ID_CURR                                             -0.002108
index                                                  -0.002277
PREV_APP_NAME_PRODUCT_TYPE_CAT_XSELL_SUM               -0.002541
PREV_APP_NAME_PAYMENT_TYPE_CAT_CASH_THROUGH_BANK_SUM   -0.004380
PREV_APP_AMT_ANNUITY_SUM                               -0.006736
PREV_APP_CNT_PAYMENT_MIN                               -0.010789
PREV_APP_AMT_GOODS_PRICE_MEAN                          -0.015847
PREV_APP_AMT_CREDIT_MEAN                               -0.016114
PREV_APP_NAME_CONTRACT_TYPE_CAT_CONSUMER_LOAN_SUM      -0.020860
PREV_APP_AMT_APPLICATION_MEAN                          -0.021803
PREV_APP_NAME_PORTFOLIO_CAT_POS_SUM                    -0.024506
PREV_APP_AMT_DOWN_PAYMENT_MEAN                         -0.024624
PREV_APP_AMT_DOWN_PAYMENT_SUM                          -0.026995
PREV_APP_NAME_CONTRACT_STATUS_Approved_SUM             -0.031553
PREV_APP_NAME_GOODS_CATEGORY_CAT_OTHER_SUM             -0.033318
PREV_APP_AMT_ANNUITY_MEAN                              -0.034871
Name: TARGET, dtype: float64
In [138]:
app_train_prev_app_merged.head()
Out[138]:
SK_ID_CURR TARGET index PREV_APP_SK_ID_PREV_COUNT PREV_APP_NAME_CONTRACT_TYPE_NUNIQUE PREV_APP_AMT_ANNUITY_MEAN PREV_APP_AMT_ANNUITY_SUM PREV_APP_AMT_APPLICATION_MEAN PREV_APP_AMT_APPLICATION_SUM PREV_APP_AMT_CREDIT_MEAN PREV_APP_AMT_CREDIT_SUM PREV_APP_AMT_DOWN_PAYMENT_MEAN PREV_APP_AMT_DOWN_PAYMENT_SUM PREV_APP_AMT_GOODS_PRICE_MEAN PREV_APP_AMT_GOODS_PRICE_SUM PREV_APP_CNT_PAYMENT_MEAN PREV_APP_CNT_PAYMENT_SUM PREV_APP_CNT_PAYMENT_MIN PREV_APP_CNT_PAYMENT_MAX PREV_APP_NAME_CONTRACT_STATUS_Approved_SUM PREV_APP_NAME_CONTRACT_STATUS_Canceled_SUM PREV_APP_NAME_CONTRACT_STATUS_Refused_SUM PREV_APP_NAME_CONTRACT_STATUS_Unused offer_SUM PREV_APP_NAME_CONTRACT_TYPE_CAT_CASH_LOAD_SUM PREV_APP_NAME_CONTRACT_TYPE_CAT_CONSUMER_LOAN_SUM PREV_APP_NAME_CONTRACT_TYPE_CAT_OTHER_SUM PREV_APP_NAME_CASH_LOAN_PURPOSE_CAT_OTHER_SUM PREV_APP_NAME_CASH_LOAN_PURPOSE_CAT_XAP_SUM PREV_APP_NAME_CASH_LOAN_PURPOSE_CAT_XNA_SUM PREV_APP_NAME_PAYMENT_TYPE_CAT_CASH_THROUGH_BANK_SUM PREV_APP_NAME_PAYMENT_TYPE_CAT_OTHER_SUM PREV_APP_NAME_PAYMENT_TYPE_CAT_XNA_SUM PREV_APP_NAME_TYPE_SUITE_CAT_Unaccompanied_SUM PREV_APP_NAME_TYPE_SUITE_CAT_accompanied_SUM PREV_APP_NAME_GOODS_CATEGORY_CAT_Mobile_SUM PREV_APP_NAME_GOODS_CATEGORY_CAT_OTHER_SUM PREV_APP_NAME_GOODS_CATEGORY_CAT_XNA_SUM PREV_APP_NAME_PORTFOLIO_CAT_Cash_SUM PREV_APP_NAME_PORTFOLIO_CAT_OTHER_SUM PREV_APP_NAME_PORTFOLIO_CAT_POS_SUM PREV_APP_NAME_PORTFOLIO_CAT_XNA_SUM PREV_APP_NAME_PRODUCT_TYPE_CAT_OTHER_SUM PREV_APP_NAME_PRODUCT_TYPE_CAT_XNA_SUM PREV_APP_NAME_PRODUCT_TYPE_CAT_XSELL_SUM PREV_APP_NAME_CONTRACT_STATUS_REFUSED_RATIO_
0 100002 1 1.0 1.0 1.0 9251.775 9251.775 179055.00 179055.00 179055.00 179055.0 0.00 0.00 179055.00 179055.00 24.000000 24.0 24.0 24.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.000000
1 100003 0 2.0 3.0 2.0 56553.990 169661.970 435436.50 1306309.50 484191.00 1452573.0 3442.50 6885.00 435436.50 1306309.50 10.000000 30.0 6.0 12.0 3.0 0.0 0.0 0.0 1.0 2.0 0.0 0.0 2.0 1.0 2.0 0.0 1.0 1.0 2.0 0.0 2.0 1.0 1.0 0.0 2.0 0.0 0.0 2.0 1.0 0.000000
2 100004 0 3.0 1.0 1.0 5357.250 5357.250 24282.00 24282.00 20106.00 20106.0 4860.00 4860.00 24282.00 24282.00 4.000000 4.0 4.0 4.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.000000
3 100006 0 5.0 9.0 3.0 23651.175 141907.050 272203.26 2449829.34 291695.50 2625259.5 34840.17 69680.34 408304.89 2449829.34 23.000000 138.0 0.0 48.0 5.0 3.0 1.0 0.0 5.0 2.0 2.0 0.0 4.0 5.0 4.0 0.0 5.0 3.0 6.0 0.0 2.0 7.0 3.0 1.0 2.0 3.0 0.0 5.0 4.0 0.090909
4 100007 0 6.0 6.0 2.0 12278.805 73672.830 150530.25 903181.50 166638.75 999832.5 3390.75 6781.50 150530.25 903181.50 20.666667 124.0 10.0 48.0 6.0 0.0 0.0 0.0 4.0 2.0 0.0 0.0 2.0 4.0 5.0 0.0 1.0 2.0 4.0 0.0 2.0 4.0 4.0 0.0 2.0 0.0 1.0 2.0 3.0 0.000000
In [139]:
app_train_numeric = app_train_prev_app_merged[ app_train_prev_app_merged.dtypes[app_train_prev_app_merged.dtypes == 'float64'].index]
numeric_index = app_train_numeric.isna().sum()[app_train_numeric.isna().sum()/len(app_train_numeric) <0.5].index[:10]
app_train_numeric =  app_train_numeric[numeric_index]

app_train_numeric['TARGET'] = app_train_prev_app_merged['TARGET']
g = app_train_numeric.groupby('TARGET')
app_train_numeric = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
sns.pairplot(app_train_numeric, hue= 'TARGET')
Out[139]:
<seaborn.axisgrid.PairGrid at 0x7fb109550358>

Installment data

Installment data description

Installments_payments dataset has repayment history for the previously disbursed credits in Home Credit related to the loans in application train dataset. Installment data will be usefull as it shows behavios of applicant in perpective of historical payment trend.

One row is equivalent to one payment of one installment. This data has corresponding foreign key in previos application data. Some of features in previous application dataset are

  • DAYS_INSTALMENT- When the installment of previous credit was supposed to be paid
  • DAYS_ENTRY_PAYMENT-When was the installments of previous credit paid actually
  • AMT_INSTALMENT-What was the prescribed installment amount of previous credit on this installment
  • AMT_PAYMENT-What the client actually paid on previous credit on this installment
In [84]:
file_list = ['application_test.csv','installments_payments.csv']


#app_test, app_train, installment, bureau_balance, pos_cash, bureau, prev_app, cc_bal = read_files(file_list, path, print_details=False)
app_test,inst_paym= read_files(file_list, path, print_details=False)
1it [00:00,  1.65it/s]
0it [00:00, ?it/s]
application_test.csv ... Read Completed
****************************************
28it [00:20,  1.35it/s]
installments_payments.csv ... Read Completed
****************************************

In [38]:
inst_paym = reduce_memory(inst_paym)
memory usage is  830.0 MB
end memory usage is  311.0 MB

Installment data EDA

In [39]:
inst_paym.shape
Out[39]:
(13605401, 8)
In [15]:
inst_paym.dtypes.value_counts()
Out[15]:
float16    3
int32      2
float32    2
int16      1
dtype: int64
In [16]:
inst_paym.select_dtypes('int').apply(pd.Series.nunique, axis = 0)
Out[16]:
Series([], dtype: float64)
In [17]:
inst_paym.describe()
Out[17]:
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
count 1.360540e+07 1.360540e+07 13605401.0 1.360540e+07 13605401.0 13602496.0 1.360540e+07 1.360250e+07
mean 1.903365e+06 2.784449e+05 NaN 1.887090e+01 NaN NaN 1.705092e+04 1.723821e+04
std 5.362029e+05 1.027183e+05 0.0 2.666407e+01 NaN NaN 5.057025e+04 5.473578e+04
min 1.000001e+06 1.000010e+05 0.0 1.000000e+00 -2922.0 -4920.0 0.000000e+00 0.000000e+00
25% 1.434191e+06 1.896390e+05 0.0 4.000000e+00 -1654.0 -1662.0 4.226085e+03 3.398265e+03
50% 1.896520e+06 2.786850e+05 1.0 8.000000e+00 -818.0 -827.0 8.884080e+03 8.125515e+03
75% 2.369094e+06 3.675300e+05 1.0 1.900000e+01 -361.0 -370.0 1.671021e+04 1.610842e+04
max 2.843499e+06 4.562550e+05 178.0 2.770000e+02 -1.0 -1.0 3.771488e+06 3.771488e+06
In [18]:
missingdata(inst_paym,-1)

Number of columns with over -1 percent missing values 
-------------------
8
Out[18]:
Total Percent
AMT_PAYMENT 2905 0.021352
DAYS_ENTRY_PAYMENT 2905 0.021352
AMT_INSTALMENT 0 0.000000
DAYS_INSTALMENT 0 0.000000
NUM_INSTALMENT_NUMBER 0 0.000000
NUM_INSTALMENT_VERSION 0 0.000000
SK_ID_CURR 0 0.000000
SK_ID_PREV 0 0.000000
In [85]:
corr_matrix = inst_paym.corr()
In [86]:
corr_matrix
Out[86]:
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
SK_ID_PREV 1.000000 0.002132 0.000685 -0.002095 0.003748 0.003734 0.002042 0.001887
SK_ID_CURR 0.002132 1.000000 0.000480 -0.000548 0.001191 0.001215 -0.000226 -0.000124
NUM_INSTALMENT_VERSION 0.000685 0.000480 1.000000 -0.323414 0.130244 0.128124 0.168109 0.177176
NUM_INSTALMENT_NUMBER -0.002095 -0.000548 -0.323414 1.000000 0.090286 0.094305 -0.089640 -0.087664
DAYS_INSTALMENT 0.003748 0.001191 0.130244 0.090286 1.000000 0.999491 0.125985 0.127018
DAYS_ENTRY_PAYMENT 0.003734 0.001215 0.128124 0.094305 0.999491 1.000000 0.125555 0.126602
AMT_INSTALMENT 0.002042 -0.000226 0.168109 -0.089640 0.125985 0.125555 1.000000 0.937191
AMT_PAYMENT 0.001887 -0.000124 0.177176 -0.087664 0.127018 0.126602 0.937191 1.000000
In [87]:
plt.subplots(figsize = (15,15))
sns.heatmap(inst_paym.corr(), cmap = 'viridis')
Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb109722ba8>
In [88]:
inst_paym.columns.tolist()
Out[88]:
['SK_ID_PREV',
 'SK_ID_CURR',
 'NUM_INSTALMENT_VERSION',
 'NUM_INSTALMENT_NUMBER',
 'DAYS_INSTALMENT',
 'DAYS_ENTRY_PAYMENT',
 'AMT_INSTALMENT',
 'AMT_PAYMENT']
In [89]:
df_inst_paym_corr = inst_paym[['DAYS_INSTALMENT',
 'DAYS_ENTRY_PAYMENT',
 'AMT_INSTALMENT',
 'AMT_PAYMENT'
]].copy()

# Calculate correlations
corr = df_inst_paym_corr.corr().abs()
# Heatmap
plt.figure(figsize=(15,8))
sns.heatmap(corr, annot=True, linewidths=.2, cmap="icefire");

Installment feature engineering

Some of features categories to evalue on Installemnt data

  • Number of installments for previous loan
  • Missed or late payment
  • Partial payment compare to actual installment
  • Ratio of missed /partial payments to total payments

Late payments/days difference between actual vs expected date of payment

In [ ]:
inst_paym['INSTALMENT_ACTUAL_DAYS_DIFF'] =  inst_paym['DAYS_ENTRY_PAYMENT']- inst_paym['DAYS_INSTALMENT'] 
inst_paym['LATE_INSTAL_Binary'] = inst_paym['INSTALMENT_ACTUAL_DAYS_DIFF'].apply(lambda x : 0 if x <= 0 else 1)

Partial payments/amount difference between actual vs expected installment amount

In [18]:
inst_paym['AMT_INSTAL_ACTUAL_DIFF'] =  inst_paym['AMT_INSTALMENT']- inst_paym['AMT_PAYMENT']
inst_paym['AMT_INSTAL_ACTUAL_DIFF_Binary'] = inst_paym['AMT_INSTAL_ACTUAL_DIFF'].apply(lambda x : 0 if x <= 0 else 1)
In [21]:
# OHE for late/partial payment
inst_paym, inst_paym_cat = one_hot_encoder(inst_paym, nan_as_category=False, columns = ['LATE_INSTAL_Binary','AMT_INSTAL_ACTUAL_DIFF_Binary'])
In [ ]:
 
In [96]:
def instasllmentfeatures(inst_paym):
    inst_paym['INSTALMENT_ACTUAL_DAYS_DIFF'] =  inst_paym['DAYS_ENTRY_PAYMENT']- inst_paym['DAYS_INSTALMENT'] 
    inst_paym['LATE_INSTAL_Binary'] = inst_paym['INSTALMENT_ACTUAL_DAYS_DIFF'].apply(lambda x : 0 if x <= 0 else 1)
    inst_paym['AMT_INSTAL_ACTUAL_DIFF'] =  inst_paym['AMT_INSTALMENT']- inst_paym['AMT_PAYMENT']
    inst_paym['AMT_INSTAL_ACTUAL_DIFF_Binary'] = inst_paym['AMT_INSTAL_ACTUAL_DIFF'].apply(lambda x : 0 if x <= 0 else 1)
    inst_paym, inst_paym_cat = one_hot_encoder(inst_paym, nan_as_category=False, columns = ['LATE_INSTAL_Binary','AMT_INSTAL_ACTUAL_DIFF_Binary'])

    num_aggregations = {
    'SK_ID_PREV' : ['count'],
    'AMT_INSTALMENT'  : ['mean', 'sum'],
    'AMT_PAYMENT': [ 'mean', 'sum'],
    'INSTALMENT_ACTUAL_DAYS_DIFF': [ 'mean', 'sum'],
    'AMT_INSTAL_ACTUAL_DIFF': [ 'mean', 'sum'],
    'LATE_INSTAL_Binary_0': ['sum'],
    'LATE_INSTAL_Binary_1': [ 'sum'],
    'AMT_INSTAL_ACTUAL_DIFF_Binary_0': ['sum'],
    'AMT_INSTAL_ACTUAL_DIFF_Binary_1': ['sum']
    }
    inst_paym_agg = inst_paym.groupby('SK_ID_CURR').agg({**num_aggregations})
    inst_paym_agg['LATE_INSTAL_RATIO'] =  inst_paym_agg['LATE_INSTAL_Binary_1']/(inst_paym_agg['LATE_INSTAL_Binary_1']+inst_paym_agg['LATE_INSTAL_Binary_0'])
    inst_paym_agg['AMT_PARTIAL_INSTAL_RATIO'] =  inst_paym_agg['AMT_INSTAL_ACTUAL_DIFF_Binary_1']/(inst_paym_agg['AMT_INSTAL_ACTUAL_DIFF_Binary_0']+inst_paym_agg['AMT_INSTAL_ACTUAL_DIFF_Binary_1'])

    inst_paym_agg.columns = [e[0] + '_' + e[1].upper() for e in inst_paym_agg.columns.to_list()]
    inst_paym_agg.columns = ['INST_PAY_' + e for e in inst_paym_agg.columns.to_list()]
    inst_paym_agg.reset_index(inplace=True)
    inst_paym_agg.to_csv("../Project/data/processed/installments_payments_processed.csv", index = False)
    return inst_paym_agg
    
In [97]:
inst_paym_agg= instasllmentfeatures(inst_paym)
In [142]:
app_train_i=app_train[['SK_ID_CURR', 'TARGET']]
In [143]:
app_train_inst_merged = app_train_i.merge(inst_paym_agg.reset_index(), 
                                            left_on='SK_ID_CURR', right_on='SK_ID_CURR', 
                                            how='left', validate='one_to_one')
In [145]:
corr_matrix = app_train_inst_merged.corr()
In [146]:
corr_matrix["TARGET"].sort_values(ascending=False)
Out[146]:
TARGET                                          1.000000
INST_PAY_LATE_INSTAL_RATIO_                     0.070639
INST_PAY_AMT_PARTIAL_INSTAL_RATIO_              0.063143
INST_PAY_LATE_INSTAL_Binary_1_SUM               0.031106
INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_SUM        0.030351
INST_PAY_AMT_INSTAL_ACTUAL_DIFF_MEAN            0.029339
INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_1_SUM    0.028849
INST_PAY_AMT_INSTAL_ACTUAL_DIFF_SUM             0.027326
INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_MEAN       0.020870
SK_ID_CURR                                     -0.002108
index                                          -0.002361
INST_PAY_AMT_INSTALMENT_MEAN                   -0.018409
INST_PAY_AMT_INSTALMENT_SUM                    -0.019811
INST_PAY_SK_ID_PREV_COUNT                      -0.021096
INST_PAY_AMT_PAYMENT_MEAN                      -0.023169
INST_PAY_AMT_PAYMENT_SUM                       -0.024375
INST_PAY_LATE_INSTAL_Binary_0_SUM              -0.028025
INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_0_SUM   -0.028911
Name: TARGET, dtype: float64
In [147]:
app_train_inst_merged.head()
Out[147]:
SK_ID_CURR TARGET index INST_PAY_SK_ID_PREV_COUNT INST_PAY_AMT_INSTALMENT_MEAN INST_PAY_AMT_INSTALMENT_SUM INST_PAY_AMT_PAYMENT_MEAN INST_PAY_AMT_PAYMENT_SUM INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_MEAN INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_MEAN INST_PAY_AMT_INSTAL_ACTUAL_DIFF_SUM INST_PAY_LATE_INSTAL_Binary_0_SUM INST_PAY_LATE_INSTAL_Binary_1_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_0_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_1_SUM INST_PAY_LATE_INSTAL_RATIO_ INST_PAY_AMT_PARTIAL_INSTAL_RATIO_
0 100002 1 1.0 19.0 11559.247105 219625.695 11559.247105 219625.695 -20.421053 -388.0 0.000000 0.000 19.0 0.0 19.0 0.0 0.000000 0.000000
1 100003 0 2.0 25.0 64754.586000 1618864.650 64754.586000 1618864.650 -7.160000 -179.0 0.000000 0.000 25.0 0.0 25.0 0.0 0.000000 0.000000
2 100004 0 3.0 3.0 7096.155000 21288.465 7096.155000 21288.465 -7.666667 -23.0 0.000000 0.000 3.0 0.0 3.0 0.0 0.000000 0.000000
3 100006 0 5.0 16.0 62947.088438 1007153.415 62947.088438 1007153.415 -19.375000 -310.0 0.000000 0.000 16.0 0.0 16.0 0.0 0.000000 0.000000
4 100007 0 6.0 66.0 12666.444545 835985.340 12214.060227 806127.975 -3.636364 -240.0 452.384318 29857.365 50.0 16.0 60.0 6.0 0.242424 0.090909
In [149]:
app_train_numeric = app_train_inst_merged[ app_train_inst_merged.dtypes[app_train_inst_merged.dtypes == 'float64'].index]
numeric_index = app_train_numeric.isna().sum()[app_train_numeric.isna().sum()/len(app_train_numeric) <0.5].index[:5]
app_train_numeric =  app_train_numeric[numeric_index]

app_train_numeric['TARGET'] = app_train_inst_merged['TARGET']
g = app_train_numeric.groupby('TARGET')
app_train_numeric = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
sns.pairplot(app_train_numeric, hue= 'TARGET')
Out[149]:
<seaborn.axisgrid.PairGrid at 0x7fb103ed6668>
In [ ]:
 
In [102]:
fig,axs = plt.subplots(1,2,figsize = (20,8))
sns.distplot(app_train_merged['INST_PAY_LATE_INSTAL_RATIO_'] , color = 'b',   bins = 20, kde = False, ax = axs[0], )

axs[0].set_title('INST_PAY_LATE_INSTAL_RATIO_'); axs[0].set_xlabel('INST_PAY_LATE_INSTAL_RATIO_'); axs[0].set_ylabel('Count');

sns.distplot(app_train_merged[app_train_merged['TARGET'] == 0]['INST_PAY_LATE_INSTAL_RATIO_'] , color = 'b',  bins = 10, hist=False, label='TARGET 0', ax = axs[1])
sns.distplot(app_train_merged[app_train_merged['TARGET'] == 1]['INST_PAY_LATE_INSTAL_RATIO_'] , color = 'r',  bins = 10, hist=False, label = 'TARGET 1',  ax = axs[1])
axs[1].set_title('INST_PAY_LATE_INSTAL_RATIO_ of applicant'); axs[1].set_xlabel('INST_PAY_LATE_INSTAL_RATIO_'); axs[1].set_ylabel('Count');
plt.legend()
Out[102]:
<matplotlib.legend.Legend at 0x7fb10ac7e898>

Credit card data

Credit card data description

Credit card balance has information about

  • Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
  • There is one row for each month of history of every previous credit in Home Credit

Some of important features in credit card data are

  • MONTHS_BALANCE - Month of balance relative to application date
  • AMT_CREDIT_LIMIT_ACTUAL -Credit card limit during the month of the previous credit
  • AMT_DRAWINGS_CURRENT - Amount drawing during the month
  • AMT_DRAWINGS_POS_CURRENT - Amount drawing or buying goods during the month
  • AMT_PAYMENT_CURRENT -How much did the client pay during the month
  • AMT_RECEIVABLE_PRINCIPAL - Amount receivable for principal on the previous credit
  • NAME_CONTRACT_STATUS - Contract status (active signed,...) on the previous credit
  • CNT_INSTALMENT_MATURE_CUM - Number of paid installments on the previous credit

Credit card data EDA

In [84]:
## Reading Application train and test file. 

file_list = ['application_test.csv','application_train.csv', 'credit_card_balance.csv']


#app_test, app_train, installment, bureau_balance, pos_cash, bureau, prev_app, cc_bal = read_files(file_list, path, print_details=False)
app_test, app_train, cc = read_files(file_list, path, print_details=False)
1it [00:00,  1.20it/s]
0it [00:00, ?it/s]
application_test.csv ... Read Completed
****************************************
1it [00:04,  4.23s/it]
0it [00:00, ?it/s]
application_train.csv ... Read Completed
****************************************
8it [00:10,  1.37s/it]
credit_card_balance.csv ... Read Completed
****************************************

In [86]:
cc.head()
Out[86]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY AMT_PAYMENT_CURRENT AMT_PAYMENT_TOTAL_CURRENT AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 2562384 378907 -6 56.970 135000 0.0 877.5 0.0 877.5 1700.325 1800.0 1800.0 0.000 0.000 0.000 0.0 1 0.0 1.0 35.0 Active 0 0
1 2582071 363914 -1 63975.555 45000 2250.0 2250.0 0.0 0.0 2250.000 2250.0 2250.0 60175.080 64875.555 64875.555 1.0 1 0.0 0.0 69.0 Active 0 0
2 1740877 371185 -7 31815.225 450000 0.0 0.0 0.0 0.0 2250.000 2250.0 2250.0 26926.425 31460.085 31460.085 0.0 0 0.0 0.0 30.0 Active 0 0
3 1389973 337855 -4 236572.110 225000 2250.0 2250.0 0.0 0.0 11795.760 11925.0 11925.0 224949.285 233048.970 233048.970 1.0 1 0.0 0.0 10.0 Active 0 0
4 1891521 126868 -1 453919.455 450000 0.0 11547.0 0.0 11547.0 22924.890 27000.0 27000.0 443044.395 453919.455 453919.455 0.0 1 0.0 1.0 101.0 Active 0 0
In [87]:
cc.shape
Out[87]:
(3840312, 23)
In [88]:
cc.dtypes.value_counts()
Out[88]:
float64    15
int64       7
object      1
dtype: int64
In [89]:
cc.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
Out[89]:
NAME_CONTRACT_STATUS    7
dtype: int64
In [90]:
cc.select_dtypes('int').apply(pd.Series.nunique, axis = 0)
Out[90]:
SK_ID_PREV                 104307
SK_ID_CURR                 103558
MONTHS_BALANCE                 96
AMT_CREDIT_LIMIT_ACTUAL       181
CNT_DRAWINGS_CURRENT          129
SK_DPD                        917
SK_DPD_DEF                    378
dtype: int64
In [91]:
cc.describe()
Out[91]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY AMT_PAYMENT_CURRENT AMT_PAYMENT_TOTAL_CURRENT AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM SK_DPD SK_DPD_DEF
count 3.840312e+06 3.840312e+06 3.840312e+06 3.840312e+06 3.840312e+06 3.090496e+06 3.840312e+06 3.090496e+06 3.090496e+06 3.535076e+06 3.072324e+06 3.840312e+06 3.840312e+06 3.840312e+06 3.840312e+06 3.090496e+06 3.840312e+06 3.090496e+06 3.090496e+06 3.535076e+06 3.840312e+06 3.840312e+06
mean 1.904504e+06 2.783242e+05 -3.452192e+01 5.830016e+04 1.538080e+05 5.961325e+03 7.433388e+03 2.881696e+02 2.968805e+03 3.540204e+03 1.028054e+04 7.588857e+03 5.596588e+04 5.808881e+04 5.809829e+04 3.094490e-01 7.031439e-01 4.812496e-03 5.594791e-01 2.082508e+01 9.283667e+00 3.316220e-01
std 5.364695e+05 1.027045e+05 2.666775e+01 1.063070e+05 1.651457e+05 2.822569e+04 3.384608e+04 8.201989e+03 2.079689e+04 5.600154e+03 3.607808e+04 3.200599e+04 1.025336e+05 1.059654e+05 1.059718e+05 1.100401e+00 3.190347e+00 8.263861e-02 3.240649e+00 2.005149e+01 9.751570e+01 2.147923e+01
min 1.000018e+06 1.000060e+05 -9.600000e+01 -4.202502e+05 0.000000e+00 -6.827310e+03 -6.211620e+03 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -4.233058e+05 -4.202502e+05 -4.202502e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 1.434385e+06 1.895170e+05 -5.500000e+01 0.000000e+00 4.500000e+04 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.523700e+02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00 0.000000e+00 0.000000e+00
50% 1.897122e+06 2.783960e+05 -2.800000e+01 0.000000e+00 1.125000e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.702700e+03 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.500000e+01 0.000000e+00 0.000000e+00
75% 2.369328e+06 3.675800e+05 -1.100000e+01 8.904669e+04 1.800000e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.633911e+03 9.000000e+03 6.750000e+03 8.535924e+04 8.889949e+04 8.891451e+04 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.200000e+01 0.000000e+00 0.000000e+00
max 2.843496e+06 4.562500e+05 -1.000000e+00 1.505902e+06 1.350000e+06 2.115000e+06 2.287098e+06 1.529847e+06 2.239274e+06 2.028820e+05 4.289207e+06 4.278316e+06 1.472317e+06 1.493338e+06 1.493338e+06 5.100000e+01 1.650000e+02 1.200000e+01 1.650000e+02 1.200000e+02 3.260000e+03 3.260000e+03
In [92]:
missingdata(cc,0)

Number of columns with over 0 percent missing values 
-------------------
9
Out[92]:
Total Percent
AMT_PAYMENT_CURRENT 767988 19.998063
AMT_DRAWINGS_OTHER_CURRENT 749816 19.524872
CNT_DRAWINGS_POS_CURRENT 749816 19.524872
CNT_DRAWINGS_OTHER_CURRENT 749816 19.524872
CNT_DRAWINGS_ATM_CURRENT 749816 19.524872
AMT_DRAWINGS_ATM_CURRENT 749816 19.524872
AMT_DRAWINGS_POS_CURRENT 749816 19.524872
CNT_INSTALMENT_MATURE_CUM 305236 7.948208
AMT_INST_MIN_REGULARITY 305236 7.948208
In [93]:
plt.figure(figsize=(15,5))
ax=sns.countplot(x='NAME_CONTRACT_STATUS', data=cc, order=cc['NAME_CONTRACT_STATUS'].value_counts(normalize=True).index);

plt.title('NAME_CONTRACT_STATUS');
plt.xticks(rotation=90);
In [94]:
corr_matrix = cc.corr()
In [95]:
corr_matrix
Out[95]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY AMT_PAYMENT_CURRENT AMT_PAYMENT_TOTAL_CURRENT AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM SK_DPD SK_DPD_DEF
SK_ID_PREV 1.000000 0.004723 0.003670 0.005046 0.006631 0.004342 0.002624 -0.000160 0.001721 0.006460 0.003472 0.001641 0.005140 0.005035 0.005032 0.002821 0.000367 -0.001412 0.000809 -0.007219 -0.001786 0.001973
SK_ID_CURR 0.004723 1.000000 0.001696 0.003510 0.005991 0.000814 0.000708 0.000958 -0.000786 0.003300 0.000127 0.000784 0.003589 0.003518 0.003524 0.002082 0.002654 -0.000131 0.002135 -0.000581 -0.000962 0.001519
MONTHS_BALANCE 0.003670 0.001696 1.000000 0.014558 0.199900 0.036802 0.065527 0.000405 0.118146 -0.087529 0.076355 0.035614 0.016266 0.013172 0.013084 0.002536 0.113321 -0.026192 0.160207 -0.008620 0.039434 0.001659
AMT_BALANCE 0.005046 0.003510 0.014558 1.000000 0.489386 0.283551 0.336965 0.065366 0.169449 0.896728 0.143934 0.151349 0.999720 0.999917 0.999897 0.309968 0.259184 0.046563 0.155553 0.005009 -0.046988 0.013009
AMT_CREDIT_LIMIT_ACTUAL 0.006631 0.005991 0.199900 0.489386 1.000000 0.247219 0.263093 0.050579 0.234976 0.467620 0.308294 0.226570 0.490445 0.488641 0.488598 0.221808 0.204237 0.030051 0.202868 -0.157269 -0.038791 -0.002236
AMT_DRAWINGS_ATM_CURRENT 0.004342 0.000814 0.036802 0.283551 0.247219 1.000000 0.800190 0.017899 0.078971 0.094824 0.189075 0.159186 0.280402 0.278290 0.278260 0.732907 0.298173 0.013254 0.076083 -0.103721 -0.022044 -0.003360
AMT_DRAWINGS_CURRENT 0.002624 0.000708 0.065527 0.336965 0.263093 0.800190 1.000000 0.236297 0.615591 0.124469 0.337343 0.305726 0.337117 0.332831 0.332796 0.594361 0.523016 0.140032 0.359001 -0.093491 -0.020606 -0.003137
AMT_DRAWINGS_OTHER_CURRENT -0.000160 0.000958 0.000405 0.065366 0.050579 0.017899 0.236297 1.000000 0.007382 0.002158 0.034577 0.025123 0.066108 0.064929 0.064923 0.012008 0.021271 0.575295 0.004458 -0.023013 -0.003693 -0.000568
AMT_DRAWINGS_POS_CURRENT 0.001721 -0.000786 0.118146 0.169449 0.234976 0.078971 0.615591 0.007382 1.000000 0.063562 0.321055 0.301760 0.173745 0.168974 0.168950 0.072658 0.520123 0.007620 0.542556 -0.106813 -0.015040 -0.002384
AMT_INST_MIN_REGULARITY 0.006460 0.003300 -0.087529 0.896728 0.467620 0.094824 0.124469 0.002158 0.063562 1.000000 0.333909 0.335201 0.896030 0.897617 0.897587 0.170616 0.148262 0.014360 0.086729 0.064320 -0.061484 -0.005715
AMT_PAYMENT_CURRENT 0.003472 0.000127 0.076355 0.143934 0.308294 0.189075 0.337343 0.034577 0.321055 0.333909 1.000000 0.994764 0.143162 0.142389 0.142371 0.142935 0.223483 0.017246 0.195074 -0.079266 -0.030222 -0.004340
AMT_PAYMENT_TOTAL_CURRENT 0.001641 0.000784 0.035614 0.151349 0.226570 0.159186 0.305726 0.025123 0.301760 0.335201 0.994764 1.000000 0.149936 0.149926 0.149914 0.125655 0.217857 0.014041 0.183973 -0.023156 -0.022475 -0.003443
AMT_RECEIVABLE_PRINCIPAL 0.005140 0.003589 0.016266 0.999720 0.490445 0.280402 0.337117 0.066108 0.173745 0.896030 0.143162 0.149936 1.000000 0.999727 0.999702 0.302627 0.258848 0.046543 0.157723 0.003664 -0.048290 0.006780
AMT_RECIVABLE 0.005035 0.003518 0.013172 0.999917 0.488641 0.278290 0.332831 0.064929 0.168974 0.897617 0.142389 0.149926 0.999727 1.000000 0.999995 0.303571 0.256347 0.046118 0.154507 0.005935 -0.046434 0.015466
AMT_TOTAL_RECEIVABLE 0.005032 0.003524 0.013084 0.999897 0.488598 0.278260 0.332796 0.064923 0.168950 0.897587 0.142371 0.149914 0.999702 0.999995 1.000000 0.303542 0.256317 0.046113 0.154481 0.005959 -0.046047 0.017243
CNT_DRAWINGS_ATM_CURRENT 0.002821 0.002082 0.002536 0.309968 0.221808 0.732907 0.594361 0.012008 0.072658 0.170616 0.142935 0.125655 0.302627 0.303571 0.303542 1.000000 0.410907 0.012730 0.108388 -0.103403 -0.029395 -0.004277
CNT_DRAWINGS_CURRENT 0.000367 0.002654 0.113321 0.259184 0.204237 0.298173 0.523016 0.021271 0.520123 0.148262 0.223483 0.217857 0.258848 0.256347 0.256317 0.410907 1.000000 0.033940 0.950546 -0.099186 -0.020786 -0.003106
CNT_DRAWINGS_OTHER_CURRENT -0.001412 -0.000131 -0.026192 0.046563 0.030051 0.013254 0.140032 0.575295 0.007620 0.014360 0.017246 0.014041 0.046543 0.046118 0.046113 0.012730 0.033940 1.000000 0.007203 -0.021632 -0.006083 -0.000895
CNT_DRAWINGS_POS_CURRENT 0.000809 0.002135 0.160207 0.155553 0.202868 0.076083 0.359001 0.004458 0.542556 0.086729 0.195074 0.183973 0.157723 0.154507 0.154481 0.108388 0.950546 0.007203 1.000000 -0.129338 -0.018212 -0.002840
CNT_INSTALMENT_MATURE_CUM -0.007219 -0.000581 -0.008620 0.005009 -0.157269 -0.103721 -0.093491 -0.023013 -0.106813 0.064320 -0.079266 -0.023156 0.003664 0.005935 0.005959 -0.103403 -0.099186 -0.021632 -0.129338 1.000000 0.059654 0.002156
SK_DPD -0.001786 -0.000962 0.039434 -0.046988 -0.038791 -0.022044 -0.020606 -0.003693 -0.015040 -0.061484 -0.030222 -0.022475 -0.048290 -0.046434 -0.046047 -0.029395 -0.020786 -0.006083 -0.018212 0.059654 1.000000 0.218950
SK_DPD_DEF 0.001973 0.001519 0.001659 0.013009 -0.002236 -0.003360 -0.003137 -0.000568 -0.002384 -0.005715 -0.004340 -0.003443 0.006780 0.015466 0.017243 -0.004277 -0.003106 -0.000895 -0.002840 0.002156 0.218950 1.000000
In [96]:
plt.subplots(figsize = (15,15))
sns.heatmap(cc.corr(), cmap = 'viridis')
Out[96]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd14ce2c1d0>
In [97]:
df_cc_corr = cc[[ 'MONTHS_BALANCE',
 'AMT_BALANCE',
 'AMT_CREDIT_LIMIT_ACTUAL',
 'AMT_DRAWINGS_CURRENT',
 'AMT_PAYMENT_CURRENT',
 'AMT_PAYMENT_TOTAL_CURRENT',
 'AMT_RECEIVABLE_PRINCIPAL',
 'AMT_RECIVABLE',
 'AMT_TOTAL_RECEIVABLE'
]].copy()

# Calculate correlations
corr = df_cc_corr.corr().abs()
# Heatmap
plt.figure(figsize=(15,8))
sns.heatmap(corr, annot=True, linewidths=.2, cmap="icefire");

Credit card feature engineering

Few feature categories which we can evaluate on credit card balance

  • Number of loans and installments
  • Past due or missed payments
  • Credit card usages type
  • Credit limit and growth
  • Avg payment compare to minimum payment required

No. Of LOANS

In [ ]:
C = cc_bal.groupby(by = ['SK_ID_CURR'])['SK_ID_PREV'].nunique().reset_index().rename(index = str, columns = {'SK_ID_PREV': 'LOAN_COUNTS'})
display(C['LOAN_COUNTS'].value_counts())

C.head()
1    102818
2       732
3         7
4         1
Name: LOAN_COUNTS, dtype: int64
Out[ ]:
SK_ID_CURR LOAN_COUNTS
0 100006 1
1 100011 1
2 100013 1
3 100021 1
4 100023 1

No. of installments

In [ ]:
grp = cc_bal.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['CNT_INSTALMENT_MATURE_CUM'].max().reset_index().rename(index = str, columns = {'CNT_INSTALMENT_MATURE_CUM': 'NO_INSTALMENTS'})
grp1 = grp.groupby(by = ['SK_ID_CURR'])['NO_INSTALMENTS'].sum().reset_index().rename(index = str, columns = {'NO_INSTALMENTS': 'TOTAL_INSTALMENTS'})

grp1.head(10)

C = C.merge(grp1, on = 'SK_ID_CURR', how = 'left')
C.head()
Out[ ]:
SK_ID_CURR LOAN_COUNTS TOTAL_INSTALMENTS
0 100006 1 0.0
1 100011 1 33.0
2 100013 1 22.0
3 100021 1 0.0
4 100023 1 0.0
In [ ]:
C.groupby('SK_ID_CURR')['TOTAL_INSTALMENTS'].agg(['mean', 'sum', 'max'])
Out[ ]:
mean sum max
SK_ID_CURR
100006 0.0 0.0 0.0
100011 33.0 33.0 33.0
100013 22.0 22.0 22.0
100021 0.0 0.0 0.0
100023 0.0 0.0 0.0
... ... ... ...
456244 17.0 17.0 17.0
456246 7.0 7.0 7.0
456247 32.0 32.0 32.0
456248 0.0 0.0 0.0
456250 10.0 10.0 10.0

103558 rows × 3 columns

In [ ]:
grp = cc_bal.groupby(by = ['SK_ID_CURR'])['AMT_DRAWINGS_ATM_CURRENT'].sum().reset_index()
grp.head()

C = C.merge(grp, on = 'SK_ID_CURR', how = 'left')
C.head()
Out[ ]:
SK_ID_CURR LOAN_COUNTS TOTAL_INSTALMENTS balance_to_limit_ratio SK_DPD PERCENTAGE_MISSED_PAYMENTS AMT_DRAWINGS_ATM_CURRENT
0 100006 1 0.0 0.00000 0 0.000000 0.0
1 100011 1 33.0 1.05000 0 0.000000 180000.0
2 100013 1 22.0 1.02489 1 1.041667 571500.0
3 100021 1 0.0 0.00000 0 0.000000 0.0
4 100023 1 0.0 0.00000 0 0.000000 0.0
In [ ]:
grp = cc_bal.groupby(by = ['SK_ID_CURR'])['AMT_DRAWINGS_CURRENT'].sum().reset_index()
grp.head()

C = C.merge(grp, on = 'SK_ID_CURR', how = 'left')
C.head()
Out[ ]:
SK_ID_CURR LOAN_COUNTS TOTAL_INSTALMENTS balance_to_limit_ratio SK_DPD PERCENTAGE_MISSED_PAYMENTS AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT
0 100006 1 0.0 0.00000 0 0.000000 0.0 0.0
1 100011 1 33.0 1.05000 0 0.000000 180000.0 180000.0
2 100013 1 22.0 1.02489 1 1.041667 571500.0 571500.0
3 100021 1 0.0 0.00000 0 0.000000 0.0 0.0
4 100023 1 0.0 0.00000 0 0.000000 0.0 0.0
In [ ]:
grp = cc_bal.groupby(by = ['SK_ID_CURR'])['CNT_DRAWINGS_CURRENT'].sum().reset_index()
grp.head()

C = C.merge(grp, on = 'SK_ID_CURR', how = 'left')
C.head()
Out[ ]:
SK_ID_CURR LOAN_COUNTS TOTAL_INSTALMENTS balance_to_limit_ratio SK_DPD PERCENTAGE_MISSED_PAYMENTS AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT CNT_DRAWINGS_CURRENT
0 100006 1 0.0 0.00000 0 0.000000 0.0 0.0 0
1 100011 1 33.0 1.05000 0 0.000000 180000.0 180000.0 4
2 100013 1 22.0 1.02489 1 1.041667 571500.0 571500.0 23
3 100021 1 0.0 0.00000 0 0.000000 0.0 0.0 0
4 100023 1 0.0 0.00000 0 0.000000 0.0 0.0 0

Max Credit Limit

In [ ]:
max_cr_limit = cc.groupby('sk_id_curr')[['amt_credit_limit_actual']].max().reset_index()
max_cr_limit.columns = ['sk_id_curr','max_cr_lim']
max_cr_limit.head()
Out[ ]:
sk_id_curr max_cr_lim
0 100006 270000
1 100011 180000
2 100013 157500
3 100021 675000
4 100023 225000

Credit Limit Growth

In [ ]:
cr_limit_growth = ((cc.groupby('sk_id_curr')['amt_credit_limit_actual'].max() - cc.groupby('sk_id_curr')['amt_credit_limit_actual'].min())
 / cc.groupby('sk_id_curr')['amt_credit_limit_actual'].min()).replace(np.inf, np.nan)

cr_limit_growth.head()
Out[ ]:
sk_id_curr
100006    0.0
100011    1.0
100013    2.5
100021    0.0
100023    4.0
Name: amt_credit_limit_actual, dtype: float64
In [ ]:
num_months = cc.groupby('sk_id_curr')['months_balance'].count()

num_months.head()
Out[ ]:
sk_id_curr
100006     6
100011    74
100013    96
100021    17
100023     8
Name: months_balance, dtype: int64
In [ ]:
cr_lim_growth_norm = (cr_limit_growth / num_months).reset_index()
cr_lim_growth_norm.columns = ['sk_id_curr', 'cr_lim_growth_norm']
In [ ]:
cr_lim_growth_norm.head()
Out[ ]:
sk_id_curr cr_lim_growth_norm
0 100006 0.000000
1 100011 0.013514
2 100013 0.026042
3 100021 0.000000
4 100023 0.500000

Avg Paid relative to minimum avg installment

In [ ]:
grp = cc.groupby(['SK_ID_CURR', 'SK_ID_PREV'])['AMT_PAYMENT_TOTAL_CURRENT',
                                                       'AMT_INST_MIN_REGULARITY'].mean().reset_index().groupby(
                                                           'SK_ID_CURR')['AMT_PAYMENT_TOTAL_CURRENT', 'AMT_INST_MIN_REGULARITY'].mean().reset_index()

grp['CC_MIN_PAYMENT_RATIO'] = avg_payment['AMT_PAYMENT_TOTAL_CURRENT'] / avg_payment['AMT_INST_MIN_REGULARITY']
In [ ]:
 
In [37]:
grp.head()
Out[37]:
SK_ID_CURR AMT_PAYMENT_TOTAL_CURRENT AMT_INST_MIN_REGULARITY
0 100006 0.000000 0.000000
1 100011 4520.067568 3956.221849
2 100013 6817.172344 1454.539551
3 100021 0.000000 0.000000
4 100023 0.000000 0.000000
In [ ]:
avg_payment_norm = (cc.groupby('sk_id_curr')['amt_payment_total_current'].mean()/ cc.groupby('sk_id_curr')['amt_inst_min_regularity'].mean()).reset_index()
In [ ]:
avg_payment_norm.columns = ['sk_id_curr', 'avg_payment_norm']
In [ ]:
avg_payment_norm
Out[ ]:
sk_id_curr avg_payment_norm
0 100006 NaN
1 100011 1.142521
2 100013 4.686825
3 100021 NaN
4 100023 NaN
... ... ...
103553 456244 5.022957
103554 456246 10.808000
103555 456247 2.909355
103556 456248 NaN
103557 456250 0.215226

103558 rows × 2 columns

Payment above minimum

In [ ]:
cc['paymentdiff'] = cc.amt_payment_total_current - cc.amt_inst_min_regularity
In [ ]:
pay_diff = cc[['sk_id_curr','paymentdiff']]
In [ ]:
avg_pay_diff = pay_diff.groupby('sk_id_curr')['paymentdiff'].mean().reset_index()
avg_pay_diff.columns = ['sk_id_curr', 'avg_pay_diff']
In [ ]:
avg_pay_diff.head()
Out[ ]:
sk_id_curr avg_pay_diff
0 100006 0.000000
1 100011 625.764452
2 100013 5898.814888
3 100021 0.000000
4 100023 0.000000
In [ ]:
med_pay_diff = pay_diff.groupby('sk_id_curr')['paymentdiff'].median().reset_index()
med_pay_diff.columns = ['sk_id_curr', 'med_pay_diff']
In [ ]:
med_pay_diff.head()
Out[ ]:
sk_id_curr med_pay_diff
0 100006 0.0
1 100011 0.0
2 100013 0.0
3 100021 0.0
4 100023 0.0

Combine columns

In [ ]:
feng_df = max_cr_limit.merge(cr_lim_growth_norm, how = 'left', on = 'sk_id_curr')
feng_df = feng_df.merge(avg_pay_diff, how = 'left', on = 'sk_id_curr')
feng_df = feng_df.merge(med_pay_diff, how = 'left', on = 'sk_id_curr')
In [ ]:
feng_df.head()
Out[ ]:
sk_id_curr max_cr_lim cr_lim_growth_norm avg_pay_diff med_pay_diff
0 100006 270000 0.000000 0.000000 0.0
1 100011 180000 0.013514 625.764452 0.0
2 100013 157500 0.026042 5898.814888 0.0
3 100021 675000 0.000000 0.000000 0.0
4 100023 225000 0.500000 0.000000 0.0
In [ ]:
(feng_df.sk_id_curr.value_counts() > 1).sum()
Out[ ]:
0

Contract Status

In [ ]:
contract = cc_bal[['SK_ID_CURR', 'NAME_CONTRACT_STATUS']]

contract = pd.get_dummies(contract, columns= ['NAME_CONTRACT_STATUS'] )
contract.head()
Out[ ]:
SK_ID_CURR NAME_CONTRACT_STATUS_Active NAME_CONTRACT_STATUS_Approved NAME_CONTRACT_STATUS_Completed NAME_CONTRACT_STATUS_Demand NAME_CONTRACT_STATUS_Refused NAME_CONTRACT_STATUS_Sent proposal NAME_CONTRACT_STATUS_Signed
0 378907 1 0 0 0 0 0 0
1 363914 1 0 0 0 0 0 0
2 371185 1 0 0 0 0 0 0
3 337855 1 0 0 0 0 0 0
4 126868 1 0 0 0 0 0 0
In [ ]:
drops = ['NAME_CONTRACT_STATUS_Approved', 'NAME_CONTRACT_STATUS_Demand',
           'NAME_CONTRACT_STATUS_Refused', 'NAME_CONTRACT_STATUS_Sent proposal',
           'NAME_CONTRACT_STATUS_Signed' ]

cc = cc.drop(drops, axis =1)

Credit Load

In [ ]:
grp = cc_bal.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['AMT_CREDIT_LIMIT_ACTUAL', 'AMT_BALANCE'].max().reset_index()
grp['balance_to_limit_ratio'] = grp['AMT_BALANCE'] / grp['AMT_CREDIT_LIMIT_ACTUAL']
display(grp.head(10))

grp2 = grp.groupby('SK_ID_CURR')['balance_to_limit_ratio'].mean().reset_index()
display(grp2.head(10))

C = C.merge(grp2, on = 'SK_ID_CURR', how = 'left')
C.head()
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  """Entry point for launching an IPython kernel.
SK_ID_CURR SK_ID_PREV AMT_CREDIT_LIMIT_ACTUAL AMT_BALANCE balance_to_limit_ratio
0 100006 1489396 270000 0.000 0.000000
1 100011 1843384 180000 189000.000 1.050000
2 100013 2038692 157500 161420.220 1.024890
3 100021 2594025 675000 0.000 0.000000
4 100023 1499902 225000 0.000 0.000000
5 100028 1914954 225000 37335.915 0.165937
6 100036 2621538 90000 0.000 0.000000
7 100042 2137382 90000 93118.455 1.034649
8 100043 1557583 427500 435861.585 1.019559
9 100047 1472630 450000 0.000 0.000000
SK_ID_CURR balance_to_limit_ratio
0 100006 0.000000
1 100011 1.050000
2 100013 1.024890
3 100021 0.000000
4 100023 0.000000
5 100028 0.165937
6 100036 0.000000
7 100042 1.034649
8 100043 1.019559
9 100047 0.000000
Out[ ]:
SK_ID_CURR LOAN_COUNTS TOTAL_INSTALMENTS balance_to_limit_ratio
0 100006 1 0.0 0.00000
1 100011 1 33.0 1.05000
2 100013 1 22.0 1.02489
3 100021 1 0.0 0.00000
4 100023 1 0.0 0.00000

Past Due

In [ ]:
grp = cc_bal.groupby('SK_ID_CURR')['SK_DPD'].sum().reset_index()
display(grp.head())

C = C.merge(grp, on = 'SK_ID_CURR', how = 'left')
C.head()
SK_ID_CURR SK_DPD
0 100006 0
1 100011 0
2 100013 1
3 100021 0
4 100023 0
Out[ ]:
SK_ID_CURR LOAN_COUNTS TOTAL_INSTALMENTS balance_to_limit_ratio SK_DPD
0 100006 1 0.0 0.00000 0
1 100011 1 33.0 1.05000 0
2 100013 1 22.0 1.02489 1
3 100021 1 0.0 0.00000 0
4 100023 1 0.0 0.00000 0
In [ ]:
## No. Of LOANS

C = cc_bal.groupby(by = ['SK_ID_CURR'])['SK_ID_PREV'].nunique().reset_index().rename(index = str, columns = {'SK_ID_PREV': 'LOAN_COUNTS'})
display(C['LOAN_COUNTS'].value_counts())

C.head()

## No. of installments

grp = cc_bal.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['CNT_INSTALMENT_MATURE_CUM'].max().reset_index().rename(index = str, columns = {'CNT_INSTALMENT_MATURE_CUM': 'NO_INSTALMENTS'})
grp1 = grp.groupby(by = ['SK_ID_CURR'])['NO_INSTALMENTS'].sum().reset_index().rename(index = str, columns = {'NO_INSTALMENTS': 'TOTAL_INSTALMENTS'})

grp1.head(10)

C = C.merge(grp1, on = 'SK_ID_CURR', how = 'left')
C.head()

C.groupby('SK_ID_CURR')['TOTAL_INSTALMENTS'].agg(['mean', 'sum', 'max'])



## Credit Load

grp = cc_bal.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['AMT_CREDIT_LIMIT_ACTUAL', 'AMT_BALANCE'].max().reset_index()
grp['balance_to_limit_ratio'] = grp['AMT_BALANCE'] / grp['AMT_CREDIT_LIMIT_ACTUAL']
display(grp.head(10))

grp2 = grp.groupby('SK_ID_CURR')['balance_to_limit_ratio'].mean().reset_index()
display(grp2.head(10))

C = C.merge(grp2, on = 'SK_ID_CURR', how = 'left')
C.head()

### Past Due

grp = cc_bal.groupby('SK_ID_CURR')['SK_DPD'].sum().reset_index()
display(grp.head())

C = C.merge(grp, on = 'SK_ID_CURR', how = 'left')
C.head()

Generating Credit card features

In [98]:
def f(min_pay, total_pay):
    
    M = min_pay.tolist()
    T = total_pay.tolist()
    P = len(M)
    c = 0 
    # Find the count of transactions when Payment made is less than Minimum Payment 
    for i in range(len(M)):
        if T[i] < M[i]:
            c += 1  
    return 100*c/P
In [99]:
def credit_data_generator():
    cc = read_files(['credit_card_balance.csv'], path, print_details=False)[0]
    

    cc['NAME_CONTRACT_STATUS'] = cc['NAME_CONTRACT_STATUS'].apply(
        lambda x : 1 if x == 'Active' else 0)
    
    num_aggregations = {
            'SK_ID_PREV' : ['nunique'],
            'MONTHS_BALANCE':["sum","mean"], 
            'AMT_BALANCE':["sum","mean","min","max"],
            'AMT_CREDIT_LIMIT_ACTUAL':["sum","mean"], 

            'AMT_DRAWINGS_ATM_CURRENT':["sum","mean","min","max"],
            'AMT_DRAWINGS_CURRENT':["sum","mean","min","max"], 
            'AMT_DRAWINGS_OTHER_CURRENT':["sum","mean","min","max"],
            'AMT_DRAWINGS_POS_CURRENT':["sum","mean","min","max"], 
            'AMT_INST_MIN_REGULARITY':["sum","mean","min","max"],
            'AMT_PAYMENT_CURRENT':["sum","mean","min","max"], 
            'AMT_PAYMENT_TOTAL_CURRENT':["sum","mean","min","max"],
            'AMT_RECEIVABLE_PRINCIPAL':["sum","mean","min","max"], 
            'AMT_RECIVABLE':["sum","mean","min","max"], 
            'AMT_TOTAL_RECEIVABLE':["sum","mean","min","max"],

            'CNT_DRAWINGS_ATM_CURRENT':["sum","mean"], 
            'CNT_DRAWINGS_CURRENT':["sum","mean","max"],
            'CNT_DRAWINGS_OTHER_CURRENT':["mean","max"], 
            'CNT_DRAWINGS_POS_CURRENT':["sum","mean","max"],
            'CNT_INSTALMENT_MATURE_CUM':["sum","mean","max","min"],    
            'SK_DPD':["sum","mean","max"], 
            'SK_DPD_DEF':["sum","mean","max"],

            'NAME_CONTRACT_STATUS':["sum","mean","min","max"], 
            # 'TOTAL_INSTALMENTS':["mean"],
    }

    cc_agg = cc.groupby('SK_ID_CURR').agg({**num_aggregations})

    cc_agg.columns = ['CC_' + e[0] + '_' + e[1].upper() for e in cc_agg.columns.to_list()]

    cc_agg.reset_index(inplace=True)

    grp = cc.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['CNT_INSTALMENT_MATURE_CUM'].max().reset_index().rename(index = str, columns = {'CNT_INSTALMENT_MATURE_CUM': 'NO_INSTALMENTS'})
    grp1 = grp.groupby(by = ['SK_ID_CURR'])['NO_INSTALMENTS'].sum().reset_index().rename(index = str, columns = {'NO_INSTALMENTS': 'CC_TOTAL_INSTALMENTS'})
    cc_agg = cc_agg.merge(grp1, on = 'SK_ID_CURR', how ='left')

    grp = cc.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['AMT_CREDIT_LIMIT_ACTUAL', 'AMT_BALANCE'].max().reset_index()
    grp['balance_to_limit_ratio'] = grp['AMT_BALANCE'] / grp['AMT_CREDIT_LIMIT_ACTUAL']
    grp2 = grp.groupby('SK_ID_CURR')['balance_to_limit_ratio'].mean().reset_index().rename(index = str, columns = {'balance_to_limit_ratio': 'CC_CREDIT_LOAD'})
    cc_agg = cc_agg.merge(grp2, on = 'SK_ID_CURR', how ='left')

    grp = cc.groupby(by = ['SK_ID_CURR']).apply(lambda x: f(x.AMT_INST_MIN_REGULARITY, x.AMT_PAYMENT_CURRENT)).reset_index().rename(
        index = str, columns = { 0 : 'CC_PERCENTAGE_MISSED_PAYMENTS'})
    cc_agg = cc_agg.merge(grp, on = 'SK_ID_CURR', how ='left')

    grp = cc.groupby(['SK_ID_CURR', 'SK_ID_PREV'])['AMT_PAYMENT_TOTAL_CURRENT',
                                                       'AMT_INST_MIN_REGULARITY'].mean().reset_index().groupby(
                                                           'SK_ID_CURR')['AMT_PAYMENT_TOTAL_CURRENT', 'AMT_INST_MIN_REGULARITY'].mean().reset_index()
    grp['CC_MIN_PAYMENT_RATIO'] = grp['AMT_PAYMENT_TOTAL_CURRENT'] / grp['AMT_INST_MIN_REGULARITY']
    cc_agg = cc_agg.merge(grp, on = 'SK_ID_CURR', how ='left')

    return cc_agg
In [100]:
cc_grouped = credit_data_generator()
8it [00:08,  1.06s/it]
credit_card_balance.csv ... Read Completed
****************************************
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:47: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:57: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:58: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
In [103]:
#cc_grouped.to_csv('/content/drive/My Drive/I526 Final Project/data/processed/cc_processed.csv', index=False)
cc_grouped.to_csv("../Project/data/processed/cc_processed.csv", index = False)
In [104]:
app_train_i=app_train[['SK_ID_CURR', 'TARGET']]
In [124]:
app_train_cc_merged = app_train_i.merge(cc_grouped.reset_index(), 
                                            left_on='SK_ID_CURR', right_on='SK_ID_CURR', 
                                             validate='one_to_one')
In [125]:
corr_matrix = app_train_cc_merged.corr()
In [126]:
corr_matrix["TARGET"].sort_values(ascending=False)
Out[126]:
TARGET                                1.000000
CC_CNT_DRAWINGS_ATM_CURRENT_MEAN      0.107692
CC_CNT_DRAWINGS_CURRENT_MAX           0.101389
CC_CREDIT_LOAD                        0.095072
CC_AMT_BALANCE_MEAN                   0.087177
CC_AMT_TOTAL_RECEIVABLE_MEAN          0.086490
CC_AMT_RECIVABLE_MEAN                 0.086478
CC_AMT_RECEIVABLE_PRINCIPAL_MEAN      0.086062
CC_CNT_DRAWINGS_CURRENT_MEAN          0.082520
AMT_INST_MIN_REGULARITY               0.073921
CC_AMT_INST_MIN_REGULARITY_MEAN       0.073724
CC_PERCENTAGE_MISSED_PAYMENTS         0.070668
CC_CNT_DRAWINGS_POS_CURRENT_MAX       0.068942
CC_AMT_BALANCE_MAX                    0.068798
CC_AMT_TOTAL_RECEIVABLE_MAX           0.068081
CC_AMT_RECIVABLE_MAX                  0.068066
CC_AMT_RECEIVABLE_PRINCIPAL_MAX       0.066919
CC_AMT_BALANCE_MIN                    0.064163
CC_AMT_INST_MIN_REGULARITY_MAX        0.063888
CC_AMT_RECIVABLE_MIN                  0.063610
CC_AMT_TOTAL_RECEIVABLE_MIN           0.063608
CC_AMT_RECEIVABLE_PRINCIPAL_MIN       0.063236
CC_MONTHS_BALANCE_MEAN                0.062081
CC_AMT_DRAWINGS_ATM_CURRENT_MEAN      0.059925
CC_MONTHS_BALANCE_SUM                 0.059051
CC_AMT_DRAWINGS_CURRENT_MEAN          0.058732
CC_CNT_DRAWINGS_POS_CURRENT_MEAN      0.052547
CC_AMT_DRAWINGS_CURRENT_MAX           0.052318
CC_CNT_DRAWINGS_CURRENT_SUM           0.050685
CC_CNT_DRAWINGS_ATM_CURRENT_SUM       0.049970
CC_CNT_DRAWINGS_POS_CURRENT_SUM       0.038457
CC_AMT_DRAWINGS_ATM_CURRENT_SUM       0.038106
CC_AMT_PAYMENT_TOTAL_CURRENT_MAX      0.028138
CC_NAME_CONTRACT_STATUS_MIN           0.026653
CC_AMT_DRAWINGS_ATM_CURRENT_MAX       0.024087
CC_AMT_DRAWINGS_CURRENT_SUM           0.023680
CC_AMT_PAYMENT_TOTAL_CURRENT_MEAN     0.022665
AMT_PAYMENT_TOTAL_CURRENT             0.022597
CC_NAME_CONTRACT_STATUS_MEAN          0.019900
CC_AMT_BALANCE_SUM                    0.018663
CC_AMT_RECEIVABLE_PRINCIPAL_SUM       0.018459
CC_AMT_TOTAL_RECEIVABLE_SUM           0.018352
CC_AMT_RECIVABLE_SUM                  0.018319
CC_CNT_DRAWINGS_OTHER_CURRENT_MEAN    0.014756
CC_AMT_DRAWINGS_ATM_CURRENT_MIN       0.012955
CC_AMT_PAYMENT_CURRENT_MIN            0.012466
CC_AMT_DRAWINGS_CURRENT_MIN           0.010938
CC_AMT_DRAWINGS_OTHER_CURRENT_MEAN    0.009798
CC_AMT_DRAWINGS_OTHER_CURRENT_SUM     0.008105
CC_SK_DPD_DEF_MAX                     0.007089
CC_SK_DPD_DEF_SUM                     0.006972
CC_SK_DPD_DEF_MEAN                    0.006660
CC_AMT_PAYMENT_CURRENT_MEAN           0.005274
CC_SK_ID_PREV_NUNIQUE                 0.004388
CC_AMT_DRAWINGS_OTHER_CURRENT_MAX     0.004003
CC_AMT_INST_MIN_REGULARITY_MIN        0.003177
CC_AMT_INST_MIN_REGULARITY_SUM        0.002897
CC_AMT_PAYMENT_TOTAL_CURRENT_MIN      0.002496
CC_NAME_CONTRACT_STATUS_MAX           0.001191
CC_AMT_PAYMENT_CURRENT_MAX            0.000438
CC_CNT_DRAWINGS_OTHER_CURRENT_MAX     0.000008
CC_AMT_DRAWINGS_OTHER_CURRENT_MIN    -0.002700
CC_SK_DPD_MEAN                       -0.003195
CC_SK_DPD_SUM                        -0.003566
index                                -0.003689
SK_ID_CURR                           -0.003729
CC_AMT_DRAWINGS_POS_CURRENT_SUM      -0.003784
CC_AMT_DRAWINGS_POS_CURRENT_MEAN     -0.004017
CC_AMT_PAYMENT_TOTAL_CURRENT_SUM     -0.004898
CC_AMT_DRAWINGS_POS_CURRENT_MIN      -0.004955
CC_AMT_PAYMENT_CURRENT_SUM           -0.005132
CC_MIN_PAYMENT_RATIO                 -0.005831
CC_SK_DPD_MAX                        -0.005975
CC_AMT_CREDIT_LIMIT_ACTUAL_MEAN      -0.008852
CC_AMT_DRAWINGS_POS_CURRENT_MAX      -0.009217
CC_TOTAL_INSTALMENTS                 -0.017407
CC_CNT_INSTALMENT_MATURE_CUM_MAX     -0.017568
CC_CNT_INSTALMENT_MATURE_CUM_MEAN    -0.028917
CC_CNT_INSTALMENT_MATURE_CUM_MIN     -0.031873
CC_CNT_INSTALMENT_MATURE_CUM_SUM     -0.042363
CC_AMT_CREDIT_LIMIT_ACTUAL_SUM       -0.045460
CC_NAME_CONTRACT_STATUS_SUM          -0.059376
Name: TARGET, dtype: float64
In [127]:
app_train_cc_merged.head()
Out[127]:
SK_ID_CURR TARGET index CC_SK_ID_PREV_NUNIQUE CC_MONTHS_BALANCE_SUM CC_MONTHS_BALANCE_MEAN CC_AMT_BALANCE_SUM CC_AMT_BALANCE_MEAN CC_AMT_BALANCE_MIN CC_AMT_BALANCE_MAX CC_AMT_CREDIT_LIMIT_ACTUAL_SUM CC_AMT_CREDIT_LIMIT_ACTUAL_MEAN CC_AMT_DRAWINGS_ATM_CURRENT_SUM CC_AMT_DRAWINGS_ATM_CURRENT_MEAN CC_AMT_DRAWINGS_ATM_CURRENT_MIN CC_AMT_DRAWINGS_ATM_CURRENT_MAX CC_AMT_DRAWINGS_CURRENT_SUM CC_AMT_DRAWINGS_CURRENT_MEAN CC_AMT_DRAWINGS_CURRENT_MIN CC_AMT_DRAWINGS_CURRENT_MAX CC_AMT_DRAWINGS_OTHER_CURRENT_SUM CC_AMT_DRAWINGS_OTHER_CURRENT_MEAN CC_AMT_DRAWINGS_OTHER_CURRENT_MIN CC_AMT_DRAWINGS_OTHER_CURRENT_MAX CC_AMT_DRAWINGS_POS_CURRENT_SUM CC_AMT_DRAWINGS_POS_CURRENT_MEAN CC_AMT_DRAWINGS_POS_CURRENT_MIN CC_AMT_DRAWINGS_POS_CURRENT_MAX CC_AMT_INST_MIN_REGULARITY_SUM CC_AMT_INST_MIN_REGULARITY_MEAN CC_AMT_INST_MIN_REGULARITY_MIN CC_AMT_INST_MIN_REGULARITY_MAX CC_AMT_PAYMENT_CURRENT_SUM CC_AMT_PAYMENT_CURRENT_MEAN CC_AMT_PAYMENT_CURRENT_MIN CC_AMT_PAYMENT_CURRENT_MAX CC_AMT_PAYMENT_TOTAL_CURRENT_SUM CC_AMT_PAYMENT_TOTAL_CURRENT_MEAN CC_AMT_PAYMENT_TOTAL_CURRENT_MIN CC_AMT_PAYMENT_TOTAL_CURRENT_MAX CC_AMT_RECEIVABLE_PRINCIPAL_SUM CC_AMT_RECEIVABLE_PRINCIPAL_MEAN CC_AMT_RECEIVABLE_PRINCIPAL_MIN CC_AMT_RECEIVABLE_PRINCIPAL_MAX CC_AMT_RECIVABLE_SUM CC_AMT_RECIVABLE_MEAN CC_AMT_RECIVABLE_MIN CC_AMT_RECIVABLE_MAX CC_AMT_TOTAL_RECEIVABLE_SUM CC_AMT_TOTAL_RECEIVABLE_MEAN CC_AMT_TOTAL_RECEIVABLE_MIN CC_AMT_TOTAL_RECEIVABLE_MAX CC_CNT_DRAWINGS_ATM_CURRENT_SUM CC_CNT_DRAWINGS_ATM_CURRENT_MEAN CC_CNT_DRAWINGS_CURRENT_SUM CC_CNT_DRAWINGS_CURRENT_MEAN CC_CNT_DRAWINGS_CURRENT_MAX CC_CNT_DRAWINGS_OTHER_CURRENT_MEAN CC_CNT_DRAWINGS_OTHER_CURRENT_MAX CC_CNT_DRAWINGS_POS_CURRENT_SUM CC_CNT_DRAWINGS_POS_CURRENT_MEAN CC_CNT_DRAWINGS_POS_CURRENT_MAX CC_CNT_INSTALMENT_MATURE_CUM_SUM CC_CNT_INSTALMENT_MATURE_CUM_MEAN CC_CNT_INSTALMENT_MATURE_CUM_MAX CC_CNT_INSTALMENT_MATURE_CUM_MIN CC_SK_DPD_SUM CC_SK_DPD_MEAN CC_SK_DPD_MAX CC_SK_DPD_DEF_SUM CC_SK_DPD_DEF_MEAN CC_SK_DPD_DEF_MAX CC_NAME_CONTRACT_STATUS_SUM CC_NAME_CONTRACT_STATUS_MEAN CC_NAME_CONTRACT_STATUS_MIN CC_NAME_CONTRACT_STATUS_MAX CC_TOTAL_INSTALMENTS CC_CREDIT_LOAD CC_PERCENTAGE_MISSED_PAYMENTS AMT_PAYMENT_TOTAL_CURRENT AMT_INST_MIN_REGULARITY CC_MIN_PAYMENT_RATIO
0 100006 0 0 1 -21 -3.5 0.000 0.000000 0.0 0.0 1620000 270000.000000 0.0 NaN NaN NaN 0.0 0.000000 0.0 0.0 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.000 0.000000 0.0 0.0 0.00 NaN NaN NaN 0.0 0.000000 0.0 0.0 0.00 0.000000 0.0 0.0 0.000 0.000000 0.000 0.0 0.000 0.000000 0.000 0.0 0.0 NaN 0 0.000000 0 NaN NaN 0.0 NaN NaN 0.0 0.000000 0.0 0.0 0 0.0 0 0 0.0 0 6 1.000000 1 1 0.0 0.00 0.0 0.000000 0.000000 NaN
1 100011 0 1 1 -2849 -38.5 4031676.225 54482.111149 0.0 189000.0 12150000 164189.189189 180000.0 2432.432432 0.0 180000.0 180000.0 2432.432432 0.0 180000.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 288804.195 3956.221849 0.0 9000.0 358386.75 4843.064189 0.0 55485.0 334485.0 4520.067568 0.0 55485.0 3877754.58 52402.088919 0.0 180000.0 4028055.255 54433.179122 -563.355 189000.0 4028055.255 54433.179122 -563.355 189000.0 4.0 0.054054 4 0.054054 4 0.0 0.0 0.0 0.0 0.0 1881.0 25.767123 33.0 1.0 0 0.0 0 0 0.0 0 74 1.000000 1 1 33.0 1.05 0.0 4520.067568 3956.221849 1.142521
2 100021 0 3 1 -170 -10.0 0.000 0.000000 0.0 0.0 11475000 675000.000000 0.0 NaN NaN NaN 0.0 0.000000 0.0 0.0 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.000 0.000000 0.0 0.0 0.00 NaN NaN NaN 0.0 0.000000 0.0 0.0 0.00 0.000000 0.0 0.0 0.000 0.000000 0.000 0.0 0.000 0.000000 0.000 0.0 0.0 NaN 0 0.000000 0 NaN NaN 0.0 NaN NaN 0.0 0.000000 0.0 0.0 0 0.0 0 0 0.0 0 7 0.411765 0 1 0.0 0.00 0.0 0.000000 0.000000 NaN
3 100023 0 4 1 -60 -7.5 0.000 0.000000 0.0 0.0 1080000 135000.000000 0.0 NaN NaN NaN 0.0 0.000000 0.0 0.0 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.000 0.000000 0.0 0.0 0.00 NaN NaN NaN 0.0 0.000000 0.0 0.0 0.00 0.000000 0.0 0.0 0.000 0.000000 0.000 0.0 0.000 0.000000 0.000 0.0 0.0 NaN 0 0.000000 0 NaN NaN 0.0 NaN NaN 0.0 0.000000 0.0 0.0 0 0.0 0 0 0.0 0 8 1.000000 1 1 0.0 0.00 0.0 0.000000 0.000000 NaN
4 100036 0 6 1 -90 -7.5 0.000 0.000000 0.0 0.0 945000 78750.000000 0.0 NaN NaN NaN 0.0 0.000000 0.0 0.0 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.000 0.000000 0.0 0.0 0.00 NaN NaN NaN 0.0 0.000000 0.0 0.0 0.00 0.000000 0.0 0.0 0.000 0.000000 0.000 0.0 0.000 0.000000 0.000 0.0 0.0 NaN 0 0.000000 0 NaN NaN 0.0 NaN NaN 0.0 0.000000 0.0 0.0 0 0.0 0 0 0.0 0 12 1.000000 1 1 0.0 0.00 0.0 0.000000 0.000000 NaN
In [136]:
app_train_numeric = app_train_cc_merged[['CC_CNT_DRAWINGS_ATM_CURRENT_MEAN', 'CC_CNT_DRAWINGS_CURRENT_MAX','CC_CREDIT_LOAD','CC_AMT_BALANCE_MEAN','CC_AMT_TOTAL_RECEIVABLE_MEAN','AMT_INST_MIN_REGULARITY']]
numeric_index = app_train_numeric.isna().sum()[app_train_numeric.isna().sum()/len(app_train_numeric) <0.5].index[:5]
app_train_numeric =  app_train_numeric[numeric_index]

app_train_numeric['TARGET'] = app_train_cc_merged['TARGET']
g = app_train_numeric.groupby('TARGET')
app_train_numeric = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
sns.pairplot(app_train_numeric, hue= 'TARGET')
/usr/local/lib/python3.6/site-packages/numpy/core/_methods.py:193: RuntimeWarning: invalid value encountered in subtract
  x = asanyarray(arr - arrmean)
/usr/local/lib/python3.6/site-packages/seaborn/distributions.py:288: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
/usr/local/lib/python3.6/site-packages/numpy/core/_methods.py:193: RuntimeWarning: invalid value encountered in subtract
  x = asanyarray(arr - arrmean)
/usr/local/lib/python3.6/site-packages/seaborn/distributions.py:288: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
Out[136]:
<seaborn.axisgrid.PairGrid at 0x7fd13bd720f0>

POS data

POS_CASH_balance Data Descritpion

  • Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
  • This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has
    • Number loans in sample *
    • Number of relative previous credits *
    • Number of months in which we have some history observable for the previous credits
    • Term of previous credit
    • Installment left to pay
    • Contract stauts in given month
In [43]:
file_list = [ 'application_train.csv', 'POS_CASH_balance.csv']
app_train, pos_cash = read_files(file_list, path, print_details=False)
1it [00:04,  4.22s/it]
0it [00:00, ?it/s]
application_train.csv ... Read Completed
****************************************
21it [00:11,  1.86it/s]
POS_CASH_balance.csv ... Read Completed
****************************************

POS data EDA

In [44]:
pos_cash=reduce_memory(pos_cash)
memory usage is  610.0 MB
end memory usage is  172.0 MB
In [45]:
pos_cash.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Dtype   
---  ------                 -----   
 0   SK_ID_PREV             int32   
 1   SK_ID_CURR             int32   
 2   MONTHS_BALANCE         int8    
 3   CNT_INSTALMENT         float16 
 4   CNT_INSTALMENT_FUTURE  float16 
 5   NAME_CONTRACT_STATUS   category
 6   SK_DPD                 int16   
 7   SK_DPD_DEF             int16   
dtypes: category(1), float16(2), int16(2), int32(2), int8(1)
memory usage: 171.7 MB
In [46]:
# Check for null values
missingdata(pos_cash,0)

Number of columns with over 0 percent missing values 
-------------------
2
Out[46]:
Total Percent
CNT_INSTALMENT_FUTURE 26087 0.260835
CNT_INSTALMENT 26071 0.260675
In [47]:
pos_cash.head()
Out[47]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 1803195 182943 -31 48.0 45.0 Active 0 0
1 1715348 367990 -33 36.0 35.0 Active 0 0
2 1784872 397406 -32 12.0 9.0 Active 0 0
3 1903291 269225 -35 48.0 42.0 Active 0 0
4 2341044 334279 -35 36.0 35.0 Active 0 0
In [48]:
pos_cash.describe()
Out[48]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE SK_DPD SK_DPD_DEF
count 1.000136e+07 1.000136e+07 1.000136e+07 9975287.0 9975271.0 1.000136e+07 1.000136e+07
mean 1.903217e+06 2.784039e+05 -3.501259e+01 NaN NaN 1.160693e+01 6.544684e-01
std 5.358465e+05 1.027637e+05 2.606657e+01 0.0 0.0 1.327140e+02 3.276249e+01
min 1.000001e+06 1.000010e+05 -9.600000e+01 1.0 0.0 0.000000e+00 0.000000e+00
25% 1.434405e+06 1.895500e+05 -5.400000e+01 10.0 3.0 0.000000e+00 0.000000e+00
50% 1.896565e+06 2.786540e+05 -2.800000e+01 12.0 7.0 0.000000e+00 0.000000e+00
75% 2.368963e+06 3.674290e+05 -1.300000e+01 24.0 14.0 0.000000e+00 0.000000e+00
max 2.843499e+06 4.562550e+05 -1.000000e+00 92.0 85.0 4.231000e+03 3.595000e+03

CNT_INSTALMENT: Term of previous credit (can change over time)

In [49]:
pos_cash['CNT_INSTALMENT'].median()

pos_cash['CNT_INSTALMENT'].plot.hist()
plt.title("Distribution of CNT_Instalment")
Out[49]:
Text(0.5, 1.0, 'Distribution of CNT_Instalment')

CNT_INSTALMENT_FUTURE: Installments left to pay on the previous credit

In [50]:
pos_cash['CNT_INSTALMENT_FUTURE'].plot.hist()
plt.title("Distribution of CNT_Instalment_Furture")
Out[50]:
Text(0.5, 1.0, 'Distribution of CNT_Instalment_Furture')
In [51]:
pos_cash['CNT_INSTALMENT_FUTURE'].median()
Out[51]:
7.0

Months Balance: Month of balance relative to application date

(-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly )

In [52]:
pos_cash['MONTHS_BALANCE'].plot.hist()
plt.title("Distribution of MONTHS_BALANCE")
Out[52]:
Text(0.5, 1.0, 'Distribution of MONTHS_BALANCE')

Overall view of Data Distribution

In [53]:
corr_matrix = pos_cash.corr()
In [54]:
corr_matrix
Out[54]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE SK_DPD SK_DPD_DEF
SK_ID_PREV 1.000000 -0.000336 0.001835 0.003820 0.003679 -0.000487 0.004848
SK_ID_CURR -0.000336 1.000000 0.000404 0.000144 -0.000559 0.003118 0.001948
MONTHS_BALANCE 0.001835 0.000404 1.000000 0.336163 0.271595 -0.018939 -0.000381
CNT_INSTALMENT 0.003820 0.000144 0.336163 1.000000 0.871276 -0.060803 -0.014154
CNT_INSTALMENT_FUTURE 0.003679 -0.000559 0.271595 0.871276 1.000000 -0.082004 -0.017436
SK_DPD -0.000487 0.003118 -0.018939 -0.060803 -0.082004 1.000000 0.245782
SK_DPD_DEF 0.004848 0.001948 -0.000381 -0.014154 -0.017436 0.245782 1.000000
In [55]:
plt.subplots(figsize = (15,15))
sns.heatmap(pos_cash.corr(), cmap = 'viridis')
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd14ec5e7b8>

POS data Fearure Enginering

In this section, we are looking at ways we could combine or eliminate features that could be added to the application train dataset for modelling

Few questions that we can evaluate

  • Number of active contracts with respect to customer
  • Ratio of current installment to months balance
  • Ratio of future to current installment
  • Remaining installments
In [57]:
pos_cash[(pos_cash['SK_ID_CURR'] == 100001) ].sort_values(by=['SK_ID_PREV', 'MONTHS_BALANCE'])
Out[57]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
7167007 1369693 100001 -57 4.0 4.0 Active 0 0
8789081 1369693 100001 -56 4.0 3.0 Active 0 0
7823681 1369693 100001 -55 4.0 2.0 Active 0 0
4704415 1369693 100001 -54 4.0 1.0 Active 0 0
2197888 1369693 100001 -53 4.0 0.0 Completed 0 0
1261679 1851984 100001 -96 4.0 2.0 Active 0 0
1891462 1851984 100001 -95 4.0 1.0 Active 7 7
8531326 1851984 100001 -94 4.0 0.0 Active 0 0
4928574 1851984 100001 -93 4.0 0.0 Completed 0 0
In [58]:
pos_cash['contract_status_binary'] = pos_cash['NAME_CONTRACT_STATUS'].map({'Active' : 0, 'Completed' : 1})
In [59]:
loan_status = pos_cash.groupby(['SK_ID_CURR', 'SK_ID_PREV'])['contract_status_binary'].sum().reset_index()
loan_status = loan_status.groupby('SK_ID_CURR')['contract_status_binary'].mean().reset_index()
loan_status.head(15)
Out[59]:
SK_ID_CURR contract_status_binary
0 100001 1.000000
1 100002 0.000000
2 100003 0.666667
3 100004 1.000000
4 100005 1.000000
5 100006 0.666667
6 100007 0.600000
7 100008 1.000000
8 100009 0.875000
9 100010 1.000000
10 100011 1.000000
11 100012 1.000000
12 100013 1.000000
13 100014 0.500000
14 100015 1.000000
In [60]:
act_contract_counts = (pos_cash.groupby('SK_ID_CURR')['contract_status_binary'].sum() / pos_cash.groupby('SK_ID_CURR')['contract_status_binary'].count()).reset_index()
act_contract_counts.columns = ['SK_ID_CURR', 'ACT_CONTRACTS']
act_contract_counts.head()
Out[60]:
SK_ID_CURR ACT_CONTRACTS
0 100001 0.222222
1 100002 0.000000
2 100003 0.071429
3 100004 0.250000
4 100005 0.100000
In [61]:
sns.kdeplot(act_contract_counts['ACT_CONTRACTS'])
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd14eaac0b8>
In [62]:
curt_instal_mont_ratio = pos_cash.groupby('SK_ID_CURR', as_index=False)['CNT_INSTALMENT', 'MONTHS_BALANCE'].sum()
curt_instal_mont_ratio['Cur_Install_Mon_Ratio'] = curt_instal_mont_ratio['CNT_INSTALMENT'] / curt_instal_mont_ratio['MONTHS_BALANCE']
curt_instal_mont_ratio.drop(['CNT_INSTALMENT', 'MONTHS_BALANCE'], axis =1, inplace=True)
curt_instal_mont_ratio.head()
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  """Entry point for launching an IPython kernel.
Out[62]:
SK_ID_CURR Cur_Install_Mon_Ratio
0 100001 -0.055130
1 100002 -2.400000
2 100003 -0.230832
3 100004 -0.147059
4 100005 -0.531818
In [63]:
sns.kdeplot(curt_instal_mont_ratio['Cur_Install_Mon_Ratio'])
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd14ea2fda0>
In [64]:
fut_instal_mont_ratio = pos_cash.groupby('SK_ID_CURR', as_index=False)['CNT_INSTALMENT_FUTURE', 'MONTHS_BALANCE'].sum()
fut_instal_mont_ratio['Fut_Install_Mon_Ratio'] = fut_instal_mont_ratio['CNT_INSTALMENT_FUTURE'] / fut_instal_mont_ratio['MONTHS_BALANCE']
fut_instal_mont_ratio.drop(['CNT_INSTALMENT_FUTURE', 'MONTHS_BALANCE'], axis =1, inplace=True)
fut_instal_mont_ratio.head()
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  """Entry point for launching an IPython kernel.
Out[64]:
SK_ID_CURR Fut_Install_Mon_Ratio
0 100001 -0.019908
1 100002 -1.500000
2 100003 -0.132137
3 100004 -0.088235
4 100005 -0.327273
In [65]:
features = pd.DataFrame({'SK_ID_CURR': pos_cash['SK_ID_CURR'].unique()})
In [66]:
pos_cash_sorted = pos_cash.sort_values(['SK_ID_CURR', 'MONTHS_BALANCE'])
group_object = pos_cash_sorted.groupby('SK_ID_CURR')['CNT_INSTALMENT_FUTURE'].last().reset_index()
group_object.rename(index=str,
                   columns={'CNT_INSTALMENT_FUTURE': 'REM_INSTALMENT'},
                    inplace=True)
group_object.head()
Out[66]:
SK_ID_CURR REM_INSTALMENT
0 100001 0.0
1 100002 6.0
2 100003 0.0
3 100004 0.0
4 100005 0.0
In [67]:
features[features['SK_ID_CURR'] == 100002]
Out[67]:
SK_ID_CURR
206934 100002
In [68]:
pos_cash_sorted['REM_INSTALMENT'] = (pos_cash_sorted['SK_DPD'] > 0).astype(int)
pos_cash_sorted['REM_INSTALMENT_TOL'] = (pos_cash_sorted['SK_DPD_DEF'] > 0).astype(int)
groupby = pos_cash_sorted.groupby(['SK_ID_CURR'])['REM_INSTALMENT', 'REM_INSTALMENT_TOL'].sum().reset_index()
groupby.head()
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  This is separate from the ipykernel package so we can avoid doing imports until
Out[68]:
SK_ID_CURR REM_INSTALMENT REM_INSTALMENT_TOL
0 100001 1 1
1 100002 0 0
2 100003 0 0
3 100004 0 0
4 100005 0 0

Generating POS data features

In [69]:
def pos_data_generator(pos_cash):
    #pos_cash = read_files(['POS_CASH_balance.csv'], path, print_details=False)[0]
    # For every unique ID we want number of active contracts
    pos_cash['contract_status_binary'] = pos_cash['NAME_CONTRACT_STATUS'].map({'Active' : 0, 'Completed' : 1})
    pos = pos_cash.groupby(['SK_ID_CURR', 'SK_ID_PREV'])['contract_status_binary'].sum().reset_index()
    pos = pos.groupby('SK_ID_CURR')['contract_status_binary'].mean().reset_index()
    #pos = (pos_cash.groupby('SK_ID_CURR')['contract_status_binary'].sum() / pos_cash.groupby('SK_ID_CURR')['contract_status_binary'].count()).reset_index()
    pos.columns = ['SK_ID_CURR', 'ACT_CONTRACTS']
    
    # For this feature we can see the ration of Current Installment to the months ratio
    grp = pos_cash.groupby('SK_ID_CURR', as_index=False)['CNT_INSTALMENT', 'MONTHS_BALANCE'].sum()
    grp['Cur_Install_Mon_Ratio'] = grp['CNT_INSTALMENT'] / grp['MONTHS_BALANCE']
    grp.drop(['CNT_INSTALMENT', 'MONTHS_BALANCE'], axis =1, inplace=True)
    pos=pos.merge(grp,on='SK_ID_CURR',how='left')
    # For this feature we can see the ration of furture Installment to the current installment
    grp = pos_cash.groupby('SK_ID_CURR', as_index=False)['CNT_INSTALMENT_FUTURE', 'MONTHS_BALANCE'].sum()
    grp['Fut_Install_Mon_Ratio'] = grp['CNT_INSTALMENT_FUTURE'] / grp['MONTHS_BALANCE']
    grp.drop(['CNT_INSTALMENT_FUTURE', 'MONTHS_BALANCE'], axis =1, inplace=True)
    pos=pos.merge(grp,on='SK_ID_CURR',how='left')
    
    # For each loan what is the number of months instalmment pending
    pos_cash_sorted = pos_cash.sort_values(['SK_ID_CURR', 'MONTHS_BALANCE'])
    grp = pos_cash_sorted.groupby('SK_ID_CURR')['CNT_INSTALMENT_FUTURE'].last().reset_index()
    grp.rename(index=str,columns={'CNT_INSTALMENT_FUTURE': 'REM_INSTALMENT'},inplace=True)
    pos=pos.merge(grp,on='SK_ID_CURR',how='left')

    # In past for each unique ID, how many times have we missed the due late and missed with tolerance
    pos_cash_sorted['PAID_LATE'] = (pos_cash_sorted['SK_DPD'] > 0).astype(int)
    pos_cash_sorted['PAID_LATE_TOL'] = (pos_cash_sorted['SK_DPD_DEF'] > 0).astype(int)
    grp = pos_cash_sorted.groupby(['SK_ID_CURR'])['PAID_LATE', 'PAID_LATE_TOL'].mean().reset_index()
    pos=pos.merge(grp,on='SK_ID_CURR',how='left')

    pos.columns = ['POS_' + e if e != 'SK_ID_CURR' else e for e in pos.columns.to_list()]

    return pos

In [70]:
df_pos_cash=pos_data_generator(pos_cash)
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:11: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  # This is added back by InteractiveShellApp.init_path()
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:16: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  app.launch_new_instance()
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:30: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
In [71]:
df_pos_cash.head()
Out[71]:
SK_ID_CURR POS_ACT_CONTRACTS POS_Cur_Install_Mon_Ratio POS_Fut_Install_Mon_Ratio POS_REM_INSTALMENT POS_PAID_LATE POS_PAID_LATE_TOL
0 100001 1.000000 -0.055130 -0.019908 0.0 0.111111 0.111111
1 100002 0.000000 -2.400000 -1.500000 6.0 0.000000 0.000000
2 100003 0.666667 -0.230832 -0.132137 0.0 0.000000 0.000000
3 100004 1.000000 -0.147059 -0.088235 0.0 0.000000 0.000000
4 100005 1.000000 -0.531818 -0.327273 0.0 0.000000 0.000000
In [72]:
df_pos_cash.describe()
Out[72]:
SK_ID_CURR POS_ACT_CONTRACTS POS_Cur_Install_Mon_Ratio POS_Fut_Install_Mon_Ratio POS_REM_INSTALMENT POS_PAID_LATE POS_PAID_LATE_TOL
count 337252.000000 337252.000000 337252.000000 337252.000000 337224.0 337252.000000 337252.000000
mean 278163.132678 0.789801 -0.783274 -0.523549 NaN 0.021046 0.009903
std 102877.889290 1.445089 0.940650 0.811706 0.0 0.078553 0.039230
min 100001.000000 0.000000 -60.000000 -60.000000 0.0 0.000000 0.000000
25% 189046.750000 0.500000 -0.966176 -0.608993 0.0 0.000000 0.000000
50% 278241.500000 0.800000 -0.502523 -0.297758 0.0 0.000000 0.000000
75% 367320.250000 1.000000 -0.256618 -0.137520 6.0 0.000000 0.000000
max 456255.000000 86.000000 -0.000000 -0.000000 80.0 1.000000 1.000000
In [73]:
#df.to_csv('/content/drive/My Drive/I526 Final Project/data/processed/pos_processed.csv',index=False)
df_pos_cash.to_csv('../Project/data/processed//pos_processed.csv',index=False)
In [74]:
df_pos_cash.columns.tolist()
Out[74]:
['SK_ID_CURR',
 'POS_ACT_CONTRACTS',
 'POS_Cur_Install_Mon_Ratio',
 'POS_Fut_Install_Mon_Ratio',
 'POS_REM_INSTALMENT',
 'POS_PAID_LATE',
 'POS_PAID_LATE_TOL']
In [75]:
app_train_i=app_train[['SK_ID_CURR', 'TARGET']]
In [76]:
app_train_df_pos_cash_merged = app_train_i.merge(df_pos_cash.reset_index(), 
                                            left_on='SK_ID_CURR', right_on='SK_ID_CURR', 
                                            how='left', validate='one_to_one')
In [77]:
corr_matrix = app_train_df_pos_cash_merged.corr()
In [78]:
corr_matrix["TARGET"].sort_values(ascending=False)
Out[78]:
TARGET                       1.000000
POS_PAID_LATE_TOL            0.047050
POS_PAID_LATE                0.030616
POS_REM_INSTALMENT           0.009208
POS_ACT_CONTRACTS            0.000941
SK_ID_CURR                  -0.002108
index                       -0.002137
POS_Fut_Install_Mon_Ratio   -0.034326
POS_Cur_Install_Mon_Ratio   -0.036532
Name: TARGET, dtype: float64
In [79]:
app_train_df_pos_cash_merged.head()
Out[79]:
SK_ID_CURR TARGET index POS_ACT_CONTRACTS POS_Cur_Install_Mon_Ratio POS_Fut_Install_Mon_Ratio POS_REM_INSTALMENT POS_PAID_LATE POS_PAID_LATE_TOL
0 100002 1 1.0 0.000000 -2.400000 -1.500000 6.0 0.0 0.0
1 100003 0 2.0 0.666667 -0.230832 -0.132137 0.0 0.0 0.0
2 100004 0 3.0 1.000000 -0.147059 -0.088235 0.0 0.0 0.0
3 100006 0 5.0 0.666667 -1.188119 -0.856436 3.0 0.0 0.0
4 100007 0 6.0 0.600000 -0.455856 -0.266667 13.0 0.0 0.0
In [83]:
app_train_numeric = app_train_df_pos_cash_merged
numeric_index = app_train_numeric.isna().sum()[app_train_numeric.isna().sum()/len(app_train_numeric) <0.5].index[:6]
app_train_numeric =  app_train_numeric[numeric_index]

app_train_numeric['TARGET'] = app_train_df_pos_cash_merged['TARGET']
g = app_train_numeric.groupby('TARGET')
app_train_numeric = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
sns.pairplot(app_train_numeric, hue= 'TARGET')
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
Out[83]:
<seaborn.axisgrid.PairGrid at 0x7fd136a677b8>

At the end of PART 1, we exported processed supplementary files. This files will be directly merged with application train and test file using first step in Pipeline. Preprocessing pipelines is explained in PART 2.

PART 2 - PRE-PROCESSING PIPELINES

image.png

Highlight of Pipeline is shown in above image. More details on coding and pre-processing steps can be found below. Preprocessed files are exported at the end of this pipeline. This pre-processed data is used for training machine learning model in PART 3


Reading Files

In [1]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [3]:
## Load necessary packages to building pipelines

import json

from sklearn.model_selection import GridSearchCV, train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import accuracy_score, f1_score, roc_curve, auc, \
roc_auc_score, confusion_matrix, plot_roc_curve, plot_confusion_matrix

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

from tqdm import tqdm
from functools import reduce
In [3]:
# Below funciton is to read data files in chucks and append the list of files

path = "/content/drive/My Drive/I526 Final Project/data/"

def read_files(list_of_files, path, print_details = True):
    df_list = []
    for file in list_of_files:
        chunksize = 500000
        filename = path + file
        
        i = 1
        for chunk in tqdm(pd.read_csv(filename, chunksize=chunksize, low_memory=False)):
            df = chunk if i == 1 else pd.concat([df, chunk])
            if print_details:
                print('-->Read Chunk...', i)
            i += 1
        df_list.append(df)
        print(file + " ... Read Completed")
        print('*'*40)
    return df_list
In [4]:
## Reading Application train and test file. 
file_list = ['application_test.csv','application_train.csv']
#Read the data into two files - Application test and Application and train dataset
app_test, app_train = read_files(file_list, path, print_details=False)
1it [00:01,  1.22s/it]
application_test.csv ... Read Completed
****************************************
1it [00:07,  7.97s/it]
application_train.csv ... Read Completed
****************************************


In order to reduce runtime for hyper parameter tuning, we randomly sampled 50% of the data from Kaggle Training Dataset. We have also stratified our target variable. However, at end we have trained full Kaggle training set to generate final model.

In [5]:
#Drop Target from Train Test dataset
dfx = app_train.drop('TARGET', axis =1)
dfy = app_train['TARGET']

X_train, _, y_train, _ = train_test_split(dfx, dfy, train_size = 0.5, stratify = dfy, random_state = 32)

#Merge both X_train and Y-train datasets
df_sample = pd.concat([X_train, y_train], axis =1).reset_index(drop= True)
df_sample.head()
Out[5]:
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR TARGET
0 362521 Cash loans M Y Y 0 135000.0 539100.0 27652.5 450000.0 Family Commercial associate Higher education Married House / apartment 0.011703 -10386 -1063 -185.0 -3079 10.0 1 1 0 1 0 0 Drivers 2.0 2 2 THURSDAY 12 0 0 0 0 0 0 Other 0.221539 0.617474 0.315472 0.0021 0.0000 0.9717 0.6124 0.0000 0.00 0.0690 0.0000 0.0417 0.0000 0.0017 0.0021 0.0 0.0 0.0021 0.0000 0.9717 0.6276 0.0000 0.0000 0.0690 0.0000 0.0417 0.0000 0.0018 0.0022 0.0 0.0 0.0021 0.0000 0.9717 0.6176 0.0000 0.00 0.0690 0.0000 0.0417 0.0000 0.0017 0.0022 0.0 0.0 reg oper account terraced house 0.0017 Stone, brick No 0.0 0.0 0.0 0.0 -628.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 0.0 0
1 426831 Cash loans F N Y 0 112500.0 371862.0 29511.0 328500.0 Family State servant Secondary / secondary special Single / not married House / apartment 0.003122 -18862 -1458 -242.0 -2413 NaN 1 1 0 1 0 0 Security staff 1.0 3 3 SUNDAY 10 0 0 0 0 0 0 Medicine NaN 0.618197 NaN 0.0619 0.0503 0.9816 0.7484 0.0080 0.00 0.1379 0.1667 0.2083 0.0481 0.0504 0.0540 0.0 0.0 0.0630 0.0497 0.9816 0.7583 0.0080 0.0000 0.1379 0.1667 0.2083 0.0487 0.0551 0.0561 0.0 0.0 0.0625 0.0503 0.9816 0.7518 0.0080 0.00 0.1379 0.1667 0.2083 0.0490 0.0513 0.0549 0.0 0.0 reg oper account block of flats 0.0467 Panel No 1.0 0.0 1.0 0.0 -739.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 1
2 385997 Cash loans F Y Y 0 90000.0 328405.5 22986.0 283500.0 Unaccompanied Pensioner Lower secondary Married House / apartment 0.010500 -22686 365243 -9788.0 -4157 19.0 1 0 0 1 0 0 NaN 2.0 3 3 WEDNESDAY 9 0 0 0 0 0 0 XNA NaN 0.219295 0.673830 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -1000.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 1.0 0.0 0.0 2.0 0
3 399257 Cash loans F Y Y 2 90000.0 509400.0 34173.0 450000.0 Unaccompanied Working Higher education Married House / apartment 0.018209 -11990 -5051 -4311.0 -3675 15.0 1 1 1 1 0 0 Accountants 4.0 3 3 WEDNESDAY 21 0 0 0 0 0 0 Business Entity Type 3 0.394391 0.400263 0.726711 0.2196 0.1192 0.9811 0.7416 0.0394 0.24 0.2069 0.3333 0.3750 0.1156 0.1790 0.2172 0.0 0.0 0.2237 0.1237 0.9811 0.7517 0.0398 0.2417 0.2069 0.3333 0.3750 0.1182 0.1956 0.2263 0.0 0.0 0.2217 0.1192 0.9811 0.7451 0.0397 0.24 0.2069 0.3333 0.3750 0.1176 0.1821 0.2211 0.0 0.0 reg oper account block of flats 0.1924 Stone, brick No 5.0 1.0 5.0 0.0 -1515.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 1.0 0
4 255429 Cash loans F N N 0 360000.0 1494486.0 48208.5 1305000.0 Unaccompanied Working Higher education Married House / apartment 0.001276 -21606 -1817 -8499.0 -4347 NaN 1 1 1 1 1 0 NaN 2.0 2 2 FRIDAY 15 0 0 0 0 0 0 Business Entity Type 3 NaN 0.582560 0.729567 0.0749 0.0778 0.9876 0.8300 0.1285 0.00 0.1607 0.1667 0.0971 0.0538 0.0611 0.0733 0.0 0.0 0.0305 0.0320 0.9876 0.8367 0.0074 0.0000 0.0345 0.1667 0.0417 0.0207 0.0266 0.0261 0.0 0.0 0.0906 0.0934 0.9876 0.8323 0.1769 0.00 0.2069 0.1667 0.0417 0.0594 0.0744 0.0921 0.0 0.0 reg oper spec account block of flats 0.0265 Panel No 0.0 0.0 0.0 0.0 -1250.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0 0

Pre-Processing Pipeline Steps

Below are the customized pipeline classes developed that were used to create new features, merging data from supplementry files and reduce dimensionality.

This will further feed into estimator pipeline.

  • Feature Adder - Based on the features, we further conduct analysis to understand what factors are critical and see if they could be combined or deleted
  • Dimensionality Reduction - We used techniques such as correlation analysis to remove similar features from the combined dataset
  • Custom Imputer - We have three different types of imputer for categorical, nominal and continuous variables
  • Binary labeler - Convert categorical columns into binary encoded (One hot encoding)

Upon completion of the above activities data was exported as clean CSV file which is used to perform model evaluation

PipeLine Step 1 : Supplementery Files Adder

There are six different sources or supplementry files that were provided within Kaggle dataset.

  • bureau.csv
  • bureau_balance.csv
  • POS_CASH_balance.csv
  • credit_card_balance.csv
  • previous_application.csv
  • installments_payments.csv

We wanted to build on our baseline model by adding features from supplementary files that were provided on kaggle. For each supplementary file we conducted all the steps of data pre-processing including EDA, feature engineering and data imputing. The results of feature engineering were processed into a separate CSV called processed __. csv files. In the below function we merge all the processed csv file into one csv for further analysis.

Processing for Bureau and Bureau Balance files was conducted by Jugal, Previous application and installments payment was conducted by Deepak, Point of Sales Cash file by Gautham and credit card balance file by Andrew.

In [6]:
class Supp_Info_Adder(BaseEstimator, TransformerMixin):
    def __init__(self): # no *args or **kargs
        pass
    def fit(self, X, y=None):
        return self  # nothing else to do

# All the processed files were merged together
    def transform(self, X, y=None):
#             df = X.copy()
            dfs = read_files(['bureau_processed.csv', 'pos_processed.csv', 'cc_processed.csv',
                                           'previous_application_processed.csv', 'installments_payments_processed.csv'],
                                           '/content/drive/My Drive/I526 Final Project/data/processed/', print_details=False)

            dfs.insert(0, X)
# Merge the data set by keeping the column SK_ID_CURR as primary key
            merge_df = reduce(lambda left,right: pd.merge(left,right,on='SK_ID_CURR', how = 'left'), dfs)

            print('supplemetery files merged to main train/test file............')

            return merge_df
# merge_df is the overall dataset
In [7]:
supp = Supp_Info_Adder()
temp = supp.transform(df_sample)
1it [00:02,  2.52s/it]
bureau_processed.csv ... Read Completed
****************************************
1it [00:01,  1.13s/it]
pos_processed.csv ... Read Completed
****************************************
1it [00:02,  2.01s/it]
cc_processed.csv ... Read Completed
****************************************
1it [00:01,  1.23s/it]
previous_application_processed.csv ... Read Completed
****************************************
1it [00:01,  1.03s/it]
installments_payments_processed.csv ... Read Completed
****************************************
supplemetery files merged to main train/test file............

PipeLine Step 2 : Feature Adder Class

As a part of Feature engineering each team member created functions that would extract added features that were created based on data pre -processing steps. Below functions combines all the modified features into one class called Feature_Adder

In [8]:
class Feature_Adder(TransformerMixin, BaseEstimator):
    '''A template for a custom transformer.'''

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
#         X = X.copy()
        
        X['DAYS_EMPLOYED'].replace(365243, np.nan, inplace= True)
        X['CODE_GENDER'].replace({'XNA': np.nan}, inplace=True)
        X['NAME_INCOME_TYPE'] = X['NAME_INCOME_TYPE'].map(lambda x: x if x != 'Maternity leave' else np.nan)
        X['NAME_FAMILY_STATUS'] = X['NAME_FAMILY_STATUS'].map(lambda x: x if x != 'Unknown' else np.nan)
        
        building_columns = ['APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
       'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
       'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
       'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
       'NONLIVINGAREA_AVG']
    
        flag_columns = [_f for _f in X.columns if 'FLAG_DOC' in _f]
        live = [_f for _f in X.columns if ('FLAG_' in _f) & ('FLAG_DOC' not in _f) & ('_FLAG_' not in _f)]

        drop_flag_columns = ['FLAG_DOCUMENT_2','FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
                    'FLAG_DOCUMENT_6','FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
                    'FLAG_DOCUMENT_9','FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11',
                    'FLAG_DOCUMENT_12','FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
                    'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17',
                    'FLAG_DOCUMENT_18','FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
                    'FLAG_DOCUMENT_21']
    
        bureau_total_columns = ['AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR']
        
        X['Annuity_Income'] = X['AMT_ANNUITY']/X['AMT_INCOME_TOTAL']
        X['Income_Cred'] = X['AMT_CREDIT']/X['AMT_INCOME_TOTAL']
        X['Income_PP'] = X['AMT_INCOME_TOTAL']/X['CNT_FAM_MEMBERS']
        X['CHILDREN_RATIO'] = (1 + X['CNT_CHILDREN']) / X['CNT_FAM_MEMBERS']
        X['INCOME_PER_CHILD'] = X['AMT_INCOME_TOTAL']/(1 + X['CNT_CHILDREN'])
        X['PAYMENTS'] = X['AMT_ANNUITY']/ X['AMT_CREDIT']
        X['NEW_CREDIT_TO_GOODS_RATIO'] = X['AMT_CREDIT'] / X['AMT_GOODS_PRICE']
        X['GOODS_INCOME'] =  X['AMT_GOODS_PRICE']/X['AMT_INCOME_TOTAL']
        X['LOAN_ACESS'] = X['AMT_CREDIT'] - X['AMT_GOODS_PRICE']
        X['INCOME_TO_EMPLOYED_RATIO'] = X['AMT_INCOME_TOTAL'] / X['DAYS_EMPLOYED']
        X['INCOME_TO_BIRTH_RATIO'] = X['AMT_INCOME_TOTAL'] / X['DAYS_BIRTH']

        X['CNT_NON_CHILD'] = X['CNT_FAM_MEMBERS'] - X['CNT_CHILDREN']


        
        X['TERM'] = X['AMT_CREDIT'] / X['AMT_ANNUITY']
        X['MEAN_BUILDING_SCORE_AVG'] = X[building_columns].mean(skipna=True, axis=1)
        X['TOTAL_BUILDING_SCORE_AVG'] = X[building_columns].sum(skipna=True, axis=1)

        X['NEW_DOC_TOTAL'] = X[flag_columns].sum(axis=1)
        X['NEW_DOC_AVG'] = X[flag_columns].mean(axis=1)
        X['NEW_DOC_STD'] = X[flag_columns].std(axis=1)
        X['NEW_DOC_KURT'] = X[flag_columns].kurtosis(axis=1)
        X['NEW_LIVE_SUM'] = X_train[live].sum(axis=1)
        X['NEW_LIVE_STD'] = X[live].std(axis=1)
        X['NEW_LIVE_KURT'] = X[live].kurtosis(axis=1)

        X['EXTERNAL_SOURCE_WEIGHTED'] = X.EXT_SOURCE_1 * 2 + X.EXT_SOURCE_2 * 3 + X.EXT_SOURCE_3 * 4

        X['AMT_REQ_CREDIT_BUREAU_TOTAL'] = X[bureau_total_columns].sum(axis=1)
        X['AGE_RANGE'] = X['DAYS_BIRTH'].apply(lambda x: self._get_age_label(x))

        inc_by_org = X[['AMT_INCOME_TOTAL', 'ORGANIZATION_TYPE']].groupby('ORGANIZATION_TYPE').median()['AMT_INCOME_TOTAL']

        X['NEW_INC_BY_ORG'] = X['ORGANIZATION_TYPE'].map(inc_by_org)
        X['NEW_EXT_SOURCES_MEAN'] = X[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
        X['OWN_CAR_AGE'].fillna(0, inplace = True)
        
        # Bureau Columns missing data and new feature
        X['Bureau_No_Record'] = X['BUREAU_LOAN_COUNT'].apply(lambda x : 1 if x != x else 0 )
        b_columns = [_b for _b in X.columns if 'BUREAU_' in _b]

        # X[b_columns].fillna(0, inplace = True)
        X[b_columns] = X[b_columns].mask(X['Bureau_No_Record'] == 1, X[b_columns].fillna(0))
        
        # BUREAU_INCOME_CREDIT_RATIO : Ration of Average Credit Amount to the total income of family or household
        X['BUREAU_INCOME_CREDIT_RATIO'] = X['BUREAU_AVG_CREDIT_AMT'] / X['AMT_INCOME_TOTAL']


        # Prev App Columns missing data and new feature
        X['Prev_App_No_Record'] = X['PREV_APP_SK_ID_PREV_COUNT'].apply(lambda x : 1 if x != x else 0 )
        pa_columns = [_pa for _pa in X.columns if 'PREV_APP_' in _pa]

        # X[pa_columns].fillna(0, inplace = True)
        X[pa_columns] = X[pa_columns].mask(X['Prev_App_No_Record'] == 1, X[pa_columns].fillna(0))

        # Installment Pay Columns missing data and new feature
        X['INST_No_Record'] = X['INST_PAY_SK_ID_PREV_COUNT'].apply(lambda x : 1 if x != x else 0 )
        inst_columns = [_ins for _ins in X.columns if 'INST_PAY_' in _ins]

        # X[in_columns].fillna(0, inplace = True)
        X[inst_columns] = X[inst_columns].mask(X['INST_No_Record'] == 1, X[inst_columns].fillna(0))      

        # CreditCard Columns missing data and new feature
        X['CC_No_Record'] = X['CC_SK_ID_PREV_NUNIQUE'].apply(lambda x : 1 if x != x else 0 )
        cc_columns = [_cc for _cc in X.columns if 'CC_' in _cc]

        # X[in_columns].fillna(0, inplace = True)
        X[cc_columns] = X[cc_columns].mask(X['CC_No_Record'] == 1, X[cc_columns].fillna(0))

        
        # POS CASH Columns missing data and new feature
        X['POS_No_Record'] = X['POS_Cur_Install_Mon_Ratio'].apply(lambda x : 1 if x != x else 0 )
        pos_columns = [_p for _p in X.columns if 'POS__' in _p]

        # X[in_columns].fillna(0, inplace = True)
        X[pos_columns] = X[pos_columns].mask(X['POS_No_Record'] == 1, X[pos_columns].fillna(0))

        
       
        print("Feature Adder commpleted............")
        return X


    def _get_age_label(self, days_birth):
        #Return the age group label (int).
        age_years = -days_birth / 365
        if age_years < 27: return 1
        elif age_years < 40: return 2
        elif age_years < 50: return 3
        elif age_years < 65: return 4
        elif age_years < 99: return 5
        else: return 0

Correlation within the Added Features

After adding all features their correlation with respect to Target variable is analzed. Some features like External Source score and AGE Range are highly correlated.

Also, using Correlation heatmap we can see that several features are highly correlated and during dimensionality reduction steps these features needs to be dropped.

In [12]:
#Check how correlated the new features are with respect to the TARGET
FEA_N.corr().abs().sort_values('TARGET', ascending=False)['TARGET']
Out[12]:
TARGET                                            1.000000
EXTERNAL_SOURCE_WEIGHTED                          0.235866
NEW_EXT_SOURCES_MEAN                              0.222703
EXT_SOURCE_3                                      0.181003
EXT_SOURCE_2                                      0.158895
EXT_SOURCE_1                                      0.156085
BUREAU_AVG_DAYS_FROM_LAST_APPLICATION             0.082857
DAYS_BIRTH                                        0.077425
AGE_RANGE                                         0.074474
DAYS_EMPLOYED                                     0.074186
BUREAU_CREDIT_ENDDATE_PERCENTAGE                  0.072131
INST_PAY_LATE_INSTAL_RATIO_                       0.071697
NEW_CREDIT_TO_GOODS_RATIO                         0.068459
BUREAU_DAYS_CREDIT_UPDATE_MEAN                    0.067412
BUREAU_MONTHS_BALANCE_SIZE_MEAN                   0.067197
PREV_APP_NAME_CONTRACT_STATUS_REFUSED_RATIO_      0.066236
PREV_APP_NAME_CONTRACT_STATUS_Refused_SUM         0.062533
REGION_RATING_CLIENT_W_CITY                       0.062312
INST_PAY_AMT_PARTIAL_INSTAL_RATIO_                0.061698
CC_CNT_DRAWINGS_ATM_CURRENT_MEAN                  0.061409
REGION_RATING_CLIENT                              0.060836
BUREAU_MONTHS_BALANCE_MIN                         0.059479
CC_CNT_DRAWINGS_CURRENT_MAX                       0.055553
DAYS_LAST_PHONE_CHANGE                            0.054233
BUREAU_ACTIVE_LOANS_PERCENTAGE                    0.052760
REG_CITY_NOT_WORK_CITY                            0.051846
DAYS_ID_PUBLISH                                   0.050959
CC_AMT_BALANCE_MEAN                               0.050246
CC_AMT_TOTAL_RECEIVABLE_MEAN                      0.049949
CC_AMT_RECIVABLE_MEAN                             0.049941
CC_AMT_RECEIVABLE_PRINCIPAL_MEAN                  0.049782
TOTAL_BUILDING_SCORE_AVG                          0.047160
FLAG_DOCUMENT_3                                   0.046923
POS_PAID_LATE_TOL                                 0.046579
CC_CNT_DRAWINGS_POS_CURRENT_MAX                   0.046436
BUREAU_STATUS_MEAN                                0.045637
CC_CREDIT_LOAD_x                                  0.045494
CC_CREDIT_LOAD_y                                  0.045494
BUREAU_VARIANCE_DAYS_FROM_LAST_APPLICATION        0.045428
FLAG_EMP_PHONE                                    0.045231
FLOORSMAX_AVG                                     0.044229
CC_CNT_DRAWINGS_CURRENT_MEAN                      0.044129
FLOORSMAX_MEDI                                    0.043875
CC_AMT_INST_MIN_REGULARITY_MEAN                   0.043802
FLOORSMAX_MODE                                    0.042864
REG_CITY_NOT_LIVE_CITY                            0.042280
CC_AMT_DRAWINGS_ATM_CURRENT_MEAN                  0.042265
CC_PERCENTAGE_MISSED_PAYMENTS                     0.040570
CC_AMT_BALANCE_MAX                                0.040549
BUREAU_STATUS_MIN                                 0.040321
CC_AMT_TOTAL_RECEIVABLE_MAX                       0.040258
CC_AMT_RECIVABLE_MAX                              0.040252
DAYS_REGISTRATION                                 0.040141
CC_AMT_RECEIVABLE_PRINCIPAL_MAX                   0.039821
REGION_POPULATION_RELATIVE                        0.039328
CC_AMT_INST_MIN_REGULARITY_MAX                    0.038587
POS_Cur_Install_Mon_Ratio                         0.038054
NEW_DOC_KURT                                      0.037898
ELEVATORS_AVG                                     0.037345
BUREAU_AVG_COUNT_CONSUMER_LOAN                    0.037087
ELEVATORS_MEDI                                    0.037017
AMT_GOODS_PRICE                                   0.036971
CC_AMT_DRAWINGS_CURRENT_MEAN                      0.036767
LOAN_ACESS                                        0.036314
POS_Fut_Install_Mon_Ratio                         0.035725
PREV_APP_CNT_PAYMENT_MAX                          0.035436
LIVINGAREA_AVG                                    0.034632
LIVE_CITY_NOT_WORK_CITY                           0.034628
TOTALAREA_MODE                                    0.034585
ELEVATORS_MODE                                    0.034524
LIVINGAREA_MEDI                                   0.034294
CC_AMT_BALANCE_MIN                                0.034241
CC_CNT_DRAWINGS_POS_CURRENT_MEAN                  0.034037
CC_AMT_RECIVABLE_MIN                              0.033932
CC_AMT_TOTAL_RECEIVABLE_MIN                       0.033931
CC_AMT_RECEIVABLE_PRINCIPAL_MIN                   0.033763
CC_AMT_DRAWINGS_CURRENT_MAX                       0.033035
PREV_APP_NAME_CONTRACT_TYPE_NUNIQUE               0.032607
NEW_INC_BY_ORG                                    0.032488
APARTMENTS_AVG                                    0.032348
BUREAU_AVG_COUNT_RARE_LOAN                        0.032165
PREV_APP_CNT_PAYMENT_MEAN                         0.032112
LIVINGAREA_MODE                                   0.032091
DEF_30_CNT_SOCIAL_CIRCLE                          0.031887
APARTMENTS_MEDI                                   0.031616
INST_PAY_LATE_INSTAL_Binary_1_SUM                 0.031585
BUREAU_TOT_COUNT_RARE_LOAN                        0.031566
CC_AMT_DRAWINGS_ATM_CURRENT_MAX                   0.031124
FLOORSMIN_AVG                                     0.030901
DEF_60_CNT_SOCIAL_CIRCLE                          0.030827
FLOORSMIN_MEDI                                    0.030725
TERM                                              0.030580
FLOORSMIN_MODE                                    0.029856
POS_PAID_LATE                                     0.029745
CC_CNT_DRAWINGS_CURRENT_SUM                       0.029591
PREV_APP_CNT_PAYMENT_SUM                          0.029471
APARTMENTS_MODE                                   0.029419
FLAG_DOCUMENT_6                                   0.029226
MEAN_BUILDING_SCORE_AVG                           0.028854
INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_1_SUM      0.028850
Bureau_No_Record                                  0.028393
NEW_DOC_STD                                       0.027795
CC_CNT_DRAWINGS_ATM_CURRENT_SUM                   0.027760
INST_PAY_AMT_INSTAL_ACTUAL_DIFF_MEAN              0.027749
LIVINGAPARTMENTS_AVG                              0.027686
INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_SUM          0.027649
AMT_CREDIT                                        0.027616
LIVINGAPARTMENTS_MEDI                             0.027013
PREV_APP_AMT_ANNUITY_MEAN                         0.026665
BUREAU_AVG_CREDIT_AMT                             0.026649
INST_PAY_AMT_INSTAL_ACTUAL_DIFF_SUM               0.026605
BASEMENTAREA_AVG                                  0.026551
LIVINGAPARTMENTS_MODE                             0.026105
BASEMENTAREA_MEDI                                 0.025989
YEARS_BUILD_AVG                                   0.025639
BUREAU_TOT_COUNT_MORTGAGE                         0.025550
YEARS_BUILD_MODE                                  0.025370
YEARS_BUILD_MEDI                                  0.025346
FLAG_WORK_PHONE                                   0.025204
INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_0_SUM      0.025073
HOUR_APPR_PROCESS_START                           0.024622
BASEMENTAREA_MODE                                 0.024470
PREV_APP_SK_ID_PREV_COUNT                         0.024356
INST_PAY_LATE_INSTAL_Binary_0_SUM                 0.024314
CC_AMT_DRAWINGS_ATM_CURRENT_SUM                   0.024270
CC_CNT_DRAWINGS_POS_CURRENT_SUM                   0.023548
FLAG_PHONE                                        0.023545
BUREAU_TOT_COUNT_CAR_LOAN                         0.023120
ENTRANCES_AVG                                     0.022788
CC_MONTHS_BALANCE_SUM                             0.022690
INCOME_TO_EMPLOYED_RATIO                          0.022634
BUREAU_INCOME_CREDIT_RATIO                        0.022583
PREV_APP_NAME_CONTRACT_STATUS_Canceled_SUM        0.022520
BUREAU_AVG_COUNT_MORTGAGE                         0.022457
ENTRANCES_MEDI                                    0.022381
PREV_APP_NAME_CONTRACT_STATUS_Approved_SUM        0.022107
COMMONAREA_AVG                                    0.021916
COMMONAREA_MEDI                                   0.021816
CC_AMT_PAYMENT_TOTAL_CURRENT_MAX                  0.021605
CC_AMT_PAYMENT_CURRENT_MEAN                       0.021581
BUREAU_AVG_COUNT_CAR_LOAN                         0.021501
BUREAU_TOT_COUNT_CREDIT_CARD_LOAN                 0.021262
INST_PAY_AMT_PAYMENT_SUM                          0.021188
PREV_APP_AMT_DOWN_PAYMENT_SUM                     0.021117
BUREAU_AVG_COUNT_CREDIT_CARD_LOAN                 0.021103
PREV_APP_AMT_DOWN_PAYMENT_MEAN                    0.020959
BUREAU_TOTAL_CREDIT_AMT                           0.020580
CC_AMT_PAYMENT_CURRENT_MAX                        0.020330
ENTRANCES_MODE                                    0.020075
CC_NAME_CONTRACT_STATUS_MIN                       0.020069
COMMONAREA_MODE                                   0.019529
BUREAU_CREDIT_DURATION_MEAN                       0.019354
CC_AMT_DRAWINGS_CURRENT_SUM                       0.019289
NEW_DOC_TOTAL                                     0.019016
NEW_DOC_AVG                                       0.019016
BUREAU_TOT_COUNT_CONSUMER_LOAN                    0.018974
CC_AMT_PAYMENT_TOTAL_CURRENT_MEAN                 0.018944
CHILDREN_RATIO                                    0.018912
INST_No_Record                                    0.018872
CC_CNT_INSTALMENT_MATURE_CUM_SUM                  0.018185
INST_PAY_AMT_PAYMENT_MEAN                         0.018130
BUREAU_LOAN_TYPES                                 0.017907
BUREAU_MONTHS_BALANCE_SIZE_SUM                    0.017764
PREV_APP_AMT_APPLICATION_MEAN                     0.017680
AMT_REQ_CREDIT_BUREAU_MON                         0.017647
INST_PAY_SK_ID_PREV_COUNT                         0.017642
BUREAU_ENDDATE_DIF_SUM                            0.017637
Prev_App_No_Record                                0.017542
BUREAU_STATUS_MAX                                 0.017475
BUREAU_CREDIT_DURATION_SUM                        0.017431
INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_MEAN         0.016783
INST_PAY_AMT_INSTALMENT_SUM                       0.016727
GOODS_INCOME                                      0.016718
CC_CNT_DRAWINGS_OTHER_CURRENT_MEAN                0.016546
CNT_CHILDREN                                      0.016491
CC_MONTHS_BALANCE_MEAN                            0.016236
CC_AMT_BALANCE_SUM                                0.016099
CC_NAME_CONTRACT_STATUS_SUM                       0.016065
CC_AMT_RECEIVABLE_PRINCIPAL_SUM                   0.016057
CC_CNT_INSTALMENT_MATURE_CUM_MIN                  0.015983
CC_AMT_TOTAL_RECEIVABLE_SUM                       0.015957
CC_AMT_RECIVABLE_SUM                              0.015937
CC_NAME_CONTRACT_STATUS_MEAN                      0.015861
NONLIVINGAREA_AVG                                 0.015678
NONLIVINGAREA_MEDI                                0.015401
CC_AMT_PAYMENT_CURRENT_MIN                        0.015056
POS_No_Record                                     0.014950
CC_SK_ID_PREV_NUNIQUE                             0.014607
CC_No_Record                                      0.014442
CC_NAME_CONTRACT_STATUS_MAX                       0.014423
NONLIVINGAREA_MODE                                0.014149
Annuity_Income                                    0.014137
FLAG_DOCUMENT_16                                  0.013805
FLAG_DOCUMENT_13                                  0.013637
INST_PAY_AMT_INSTALMENT_MEAN                      0.013354
BUREAU_ENDDATE_DIF_MEAN                           0.012713
CC_AMT_DRAWINGS_OTHER_CURRENT_MEAN                0.012584
CC_AMT_DRAWINGS_POS_CURRENT_MAX                   0.012446
BUREAU_AVG_CREDIT_CARD_LIMIT                      0.012256
PREV_APP_AMT_CREDIT_MEAN                          0.012189
PAYMENTS                                          0.012183
CNT_NON_CHILD                                     0.012137
LANDAREA_MEDI                                     0.011543
PREV_APP_AMT_GOODS_PRICE_MEAN                     0.011535
CC_CNT_DRAWINGS_OTHER_CURRENT_MAX                 0.011469
LANDAREA_AVG                                      0.011446
INCOME_TO_BIRTH_RATIO                             0.011262
CC_AMT_DRAWINGS_OTHER_CURRENT_MAX                 0.010980
BUREAU_TOTAL_CREDIT_CARD_LIMIT                    0.010718
AMT_ANNUITY                                       0.010602
LANDAREA_MODE                                     0.010006
OBS_30_CNT_SOCIAL_CIRCLE                          0.009980
OBS_60_CNT_SOCIAL_CIRCLE                          0.009923
POS_REM_INSTALMENT                                0.009866
CC_AMT_DRAWINGS_POS_CURRENT_MEAN                  0.009772
PREV_APP_AMT_CREDIT_SUM                           0.009489
CC_AMT_CREDIT_LIMIT_ACTUAL_SUM                    0.009135
REG_REGION_NOT_WORK_REGION                        0.008952
FLAG_DOCUMENT_14                                  0.008802
CC_AMT_DRAWINGS_OTHER_CURRENT_SUM                 0.008732
PREV_APP_CNT_PAYMENT_MIN                          0.008623
CC_CNT_INSTALMENT_MATURE_CUM_MEAN                 0.008495
CC_AMT_INST_MIN_REGULARITY_SUM                    0.008205
FLAG_DOCUMENT_8                                   0.008075
BUREAU_LOAN_COUNT                                 0.007986
BUREAU_AVG_OVERDUE_DAYS                           0.007820
YEARS_BEGINEXPLUATATION_MEDI                      0.007582
YEARS_BEGINEXPLUATATION_AVG                       0.007271
FLAG_DOCUMENT_18                                  0.007236
CNT_FAM_MEMBERS                                   0.007178
AMT_REQ_CREDIT_BUREAU_QRT                         0.006880
Income_Cred                                       0.006846
CC_AMT_CREDIT_LIMIT_ACTUAL_MEAN                   0.006433
AMT_REQ_CREDIT_BUREAU_YEAR                        0.006393
LIVE_REGION_NOT_WORK_REGION                       0.006327
BUREAU_MONTHS_BALANCE_MAX                         0.006239
PREV_APP_AMT_GOODS_PRICE_SUM                      0.006065
PREV_APP_AMT_APPLICATION_SUM                      0.006011
BUREAU_AVG_CURRENT_AMT_DUE                        0.005980
BUREAU_AVG_CREDIT_OVERDUE                         0.005957
FLAG_DOCUMENT_9                                   0.005893
YEARS_BEGINEXPLUATATION_MODE                      0.005870
FLAG_DOCUMENT_15                                  0.005507
CC_AMT_DRAWINGS_ATM_CURRENT_MIN                   0.005338
CC_AMT_DRAWINGS_CURRENT_MIN                       0.005308
REG_REGION_NOT_LIVE_REGION                        0.005204
FLAG_DOCUMENT_21                                  0.004937
INCOME_PER_CHILD                                  0.004770
CC_AMT_PAYMENT_CURRENT_SUM                        0.004579
CC_AMT_PAYMENT_TOTAL_CURRENT_SUM                  0.004453
PREV_APP_NAME_CONTRACT_STATUS_Unused offer_SUM    0.004414
PREV_APP_AMT_ANNUITY_SUM                          0.004350
CC_SK_DPD_DEF_MAX                                 0.004220
CC_AMT_INST_MIN_REGULARITY_MIN                    0.004133
BUREAU_AVG_CREDIT_DEBT                            0.003981
BUREAU_AVG_ANNUITY                                0.003961
FLAG_DOCUMENT_11                                  0.003945
CC_SK_DPD_DEF_MEAN                                0.003860
CC_SK_DPD_DEF_SUM                                 0.003806
SK_ID_CURR                                        0.003788
CC_AMT_DRAWINGS_POS_CURRENT_SUM                   0.003709
BUREAU_MAX_ANNUITY                                0.003696
AMT_REQ_CREDIT_BUREAU_TOTAL                       0.002994
FLAG_CONT_MOBILE                                  0.002673
BUREAU_TOTAL_PROLONGED_CREDIT                     0.002641
NEW_LIVE_STD                                      0.002628
CC_CNT_INSTALMENT_MATURE_CUM_MAX                  0.002623
CC_SK_DPD_MAX                                     0.002558
CC_TOTAL_INSTALMENTS                              0.002553
AMT_REQ_CREDIT_BUREAU_DAY                         0.002532
FLAG_DOCUMENT_4                                   0.002507
FLAG_DOCUMENT_5                                   0.002268
CC_AMT_DRAWINGS_POS_CURRENT_MIN                   0.001861
OWN_CAR_AGE                                       0.001816
AMT_INCOME_TOTAL                                  0.001784
BUREAU_TOTAL_CREDIT_DEBT                          0.001733
FLAG_DOCUMENT_2                                   0.001539
FLAG_DOCUMENT_19                                  0.001536
FLAG_DOCUMENT_17                                  0.001414
NONLIVINGAPARTMENTS_AVG                           0.001350
CC_AMT_PAYMENT_TOTAL_CURRENT_MIN                  0.001329
FLAG_DOCUMENT_10                                  0.001309
NONLIVINGAPARTMENTS_MEDI                          0.001305
FLAG_DOCUMENT_20                                  0.001246
FLAG_EMAIL                                        0.001244
AMT_REQ_CREDIT_BUREAU_HOUR                        0.001227
NEW_LIVE_SUM                                      0.001137
CC_SK_DPD_SUM                                     0.001113
CC_SK_DPD_MEAN                                    0.001037
BUREAU_DEBT_PERCENTAGE_MEAN                       0.000956
NEW_LIVE_KURT                                     0.000864
POS_ACT_CONTRACTS                                 0.000826
CC_AMT_DRAWINGS_OTHER_CURRENT_MIN                 0.000802
AMT_REQ_CREDIT_BUREAU_WEEK                        0.000775
FLAG_MOBIL                                        0.000756
FLAG_DOCUMENT_7                                   0.000721
NONLIVINGAPARTMENTS_MODE                          0.000528
Income_PP                                         0.000391
FLAG_DOCUMENT_12                                       NaN
BUREAU_AVG_COUNT_NAN_LOAN                              NaN
BUREAU_TOT_COUNT_NAN_LOAN                              NaN
Name: TARGET, dtype: float64
In [13]:
#Lets try to look at this visually on heatmap
plt.subplots(figsize = (15,15))
sns.heatmap(FEA_N.corr().abs(), cmap = 'viridis') 

#based on the correlation between features we would go ahead and reduce dimensionality using high r value (greater than 0.9)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d9452ee80>

PipeLine Step 3 : Class for Removing unusual/Strange data

While creating new features, we created few ratios that would add meaningful and significant features to the overall dataset. During creation of ratios we noticed some unusual and infinite values that needed to be removed.This is basically due to few missing values in the orginal dataset contributing to the infinite values.

Below function address the aforementioned problem

In [14]:
class Remove_abnormal(TransformerMixin, BaseEstimator):
    '''A template for a custom transformer.'''

    def __init__(self):
        self.columns_to_drop = []
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # transform X via code or additional methods
#         X = X.copy()
        
        X.replace(np.inf, np.nan, inplace = True)
        X.replace(-np.inf, np.nan, inplace = True)
        
        print("Abnormality removed............")
        return X

PipeLine Step 4 : Dimensionality Reduction

We use three methods to reduce the number of features in the dataset

  • With Features that have more 60% of data missing, we have dropped the feature
  • Feature which have a low standard deviation, that are less than 0.01 are ignored
  • Drop features that are highly correlated - Where the correlation r is greater than 90%
  • We also droped unique column SK_ID_CURR as we retain the unique column from the App train test datset
In [4]:
class DimReduction(TransformerMixin, BaseEstimator):
    '''A template for a custom transformer.'''

    def __init__(self):
        self.columns_to_drop = []
        pass

    def fit(self, X, y=None):
        # Find feature where the missing values are more than 60%
        missing_columns = X.columns[X.isna().mean() > 0.6].to_list()
        # Find feature where variance is very low
        columns_with_small_variance = list(X.std()[(X.std() < 0.01)].index)
        # Remove Highly correlated features
        corr_matrix = X.corr().abs()
        mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
        tri_df = corr_matrix.mask(mask)
        high_correlated_columns = [c for c in tri_df.columns
                                   if any(tri_df[c] >  0.9)]
        
        self.columns_to_drop = list(set(missing_columns +
                                        columns_with_small_variance +
                                        high_correlated_columns)) 
        
        self.columns_to_drop.append('SK_ID_CURR')
        return self

    def transform(self, X):
        # transform X via code or additional methods
        X = X.drop(self.columns_to_drop, axis=1)
        
        print("Dimensionality reduced............")
        return X

PipeLine Step 5 : Custom Imputer

We further clean up and prepare the overall dataset by following transformation:

  • Catergorical Varaible - We use imputing using the most frequent values
  • Discrete Numerical Variable - We use imputing using the most frequent values. Discrete Numerical variale defined where Unique Numerical value is less than 25.
  • Continuous Numerical Variable - We use median to replace missing values
In [16]:
class CustomImputer(TransformerMixin, BaseEstimator):
    '''A template for a custom transformer.'''

    def __init__(self):
        
        self.cat = SimpleImputer(strategy='most_frequent')
        self.disnum = SimpleImputer(strategy='most_frequent')
        self.num = SimpleImputer(strategy='median')
        
#         self.ohe = OneHotEncoder(handle_unknown='ignore'))

        
        ## Column Lists
        self.cat_list = []
        self.dis_num_list = []
        self.num_list = []
  

    def fit(self, X, y=None):
        self.cat_list, self.dis_num_list, self.num_list = self._feature_type_split(X)
        
        self.cat.fit(X[self.cat_list])
        self.disnum.fit(X[self.dis_num_list])
        self.num.fit(X[self.num_list])
        
        return self

    def transform(self, X):
        # transform X via code or additional methods
#         X = X.copy()
        
        # Categorical List
        X[self.cat_list] = self.cat.transform(X[self.cat_list])
    
        # Distinct Numerical List
        X[self.dis_num_list] = self.disnum.transform(X[self.dis_num_list])

        # Continous Numerical List
        X[self.num_list] = self.num.transform(X[self.num_list])
        
        print("Imputing completed............")
        return X
    
    def _feature_type_split(self, X):
        cat_list = []
        dis_num_list = []
        num_list = []
        df = X.copy()
        # df= df.drop('SK_ID_CURR', axis =1)
        for i in df.columns.tolist():
            if i != 'TARGET':
                if df[i].dtype == 'object':
                    cat_list.append(i)
                elif df[i].nunique() < 25:
                    dis_num_list.append(i)
                else:
                    num_list.append(i)
        print('Columns split completed............')
        return cat_list, dis_num_list, num_list

PipeLine Step 6 : Binary Encoding

For categorical features we also performed One-Hot encoding to distinguish between features and use it effectively for our model building.

Below we a class to perform the Binary Hot encoding on the dataset

In [17]:
class Binary_Labelizer(TransformerMixin, BaseEstimator):
    '''A template for a custom transformer.'''

    def __init__(self):
        self.columns_to_drop = []
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
       
        X = pd.get_dummies(X, drop_first=True)

        
        print("One Hot Encoding Performed............")
        return X

Bring it Altogether

Now that all the above class defined, we will put it inside pipeline and test it using Pipeline Test above class using the below pipeline

In [18]:
data_pipeline = Pipeline([('merger', Supp_Info_Adder()),
                          ('feature_adder', Feature_Adder()),
                          ('abnormal_data', Remove_abnormal()),
                          ('dimensionality_reduction', DimReduction()),
                          ('custom_imputer', CustomImputer()),
                          ('ohe', Binary_Labelizer()),
                         ])

Perform Fit and Transform on Train data

We take the stratified train data and fit the values. This is 50% of the data, which would be further used for HyperParamter tunning in Part 3

In [19]:
result_train = data_pipeline.fit_transform(df_sample)
1it [00:01,  1.91s/it]
0it [00:00, ?it/s]
bureau_processed.csv ... Read Completed
****************************************
1it [00:00,  3.09it/s]
0it [00:00, ?it/s]
pos_processed.csv ... Read Completed
****************************************
1it [00:01,  1.64s/it]
0it [00:00, ?it/s]
cc_processed.csv ... Read Completed
****************************************
1it [00:01,  1.02s/it]
0it [00:00, ?it/s]
previous_application_processed.csv ... Read Completed
****************************************
1it [00:00,  1.27it/s]
installments_payments_processed.csv ... Read Completed
****************************************
supplemetery files merged to main train/test file............
Feature Adder commpleted............
Abnormality removed............
Dimensionality reduced............
Columns split completed............
Imputing completed............
One Hot Encoding Performed............
In [21]:
# Describe
print("Shape of final train file is : " , result_train.shape)
result_train.head()
Shape of final train file is :  (153755, 311)
Out[21]:
CNT_CHILDREN AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_3 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_11 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR TARGET BUREAU_LOAN_TYPES BUREAU_ACTIVE_LOANS_PERCENTAGE BUREAU_CREDIT_ENDDATE_PERCENTAGE BUREAU_VARIANCE_DAYS_FROM_LAST_APPLICATION BUREAU_DAYS_CREDIT_UPDATE_MEAN BUREAU_AVG_OVERDUE_DAYS BUREAU_AVG_CREDIT_OVERDUE BUREAU_AVG_CREDIT_AMT BUREAU_TOTAL_CREDIT_AMT BUREAU_AVG_CREDIT_DEBT BUREAU_TOTAL_CREDIT_DEBT BUREAU_AVG_CURRENT_AMT_DUE BUREAU_AVG_CREDIT_CARD_LIMIT BUREAU_TOTAL_CREDIT_CARD_LIMIT BUREAU_MAX_ANNUITY BUREAU_AVG_ANNUITY BUREAU_TOTAL_PROLONGED_CREDIT BUREAU_MONTHS_BALANCE_MIN BUREAU_MONTHS_BALANCE_MAX BUREAU_MONTHS_BALANCE_SIZE_MEAN BUREAU_MONTHS_BALANCE_SIZE_SUM BUREAU_STATUS_MIN BUREAU_STATUS_MAX BUREAU_STATUS_MEAN BUREAU_DEBT_PERCENTAGE_MEAN BUREAU_CREDIT_DURATION_MEAN BUREAU_CREDIT_DURATION_SUM BUREAU_ENDDATE_DIF_MEAN BUREAU_ENDDATE_DIF_SUM BUREAU_AVG_COUNT_CAR_LOAN BUREAU_TOT_COUNT_CAR_LOAN BUREAU_AVG_COUNT_CONSUMER_LOAN BUREAU_TOT_COUNT_CONSUMER_LOAN BUREAU_AVG_COUNT_CREDIT_CARD_LOAN BUREAU_TOT_COUNT_CREDIT_CARD_LOAN BUREAU_AVG_COUNT_MORTGAGE BUREAU_TOT_COUNT_MORTGAGE BUREAU_AVG_COUNT_RARE_LOAN BUREAU_TOT_COUNT_RARE_LOAN POS_ACT_CONTRACTS POS_Fut_Install_Mon_Ratio POS_REM_INSTALMENT POS_PAID_LATE POS_PAID_LATE_TOL CC_AMT_CREDIT_LIMIT_ACTUAL_SUM CC_AMT_CREDIT_LIMIT_ACTUAL_MEAN CC_AMT_DRAWINGS_ATM_CURRENT_SUM CC_AMT_DRAWINGS_ATM_CURRENT_MEAN CC_AMT_DRAWINGS_ATM_CURRENT_MIN CC_AMT_DRAWINGS_ATM_CURRENT_MAX CC_AMT_DRAWINGS_CURRENT_MIN CC_AMT_DRAWINGS_CURRENT_MAX CC_AMT_DRAWINGS_OTHER_CURRENT_MEAN CC_AMT_DRAWINGS_OTHER_CURRENT_MIN CC_AMT_DRAWINGS_OTHER_CURRENT_MAX CC_AMT_DRAWINGS_POS_CURRENT_SUM CC_AMT_DRAWINGS_POS_CURRENT_MEAN CC_AMT_DRAWINGS_POS_CURRENT_MIN CC_AMT_DRAWINGS_POS_CURRENT_MAX CC_AMT_INST_MIN_REGULARITY_MIN CC_AMT_PAYMENT_CURRENT_MIN CC_AMT_PAYMENT_TOTAL_CURRENT_SUM CC_AMT_PAYMENT_TOTAL_CURRENT_MEAN CC_AMT_PAYMENT_TOTAL_CURRENT_MIN CC_AMT_PAYMENT_TOTAL_CURRENT_MAX CC_AMT_TOTAL_RECEIVABLE_SUM CC_AMT_TOTAL_RECEIVABLE_MIN CC_AMT_TOTAL_RECEIVABLE_MAX CC_CNT_DRAWINGS_ATM_CURRENT_SUM CC_CNT_DRAWINGS_ATM_CURRENT_MEAN CC_CNT_DRAWINGS_OTHER_CURRENT_MEAN CC_CNT_DRAWINGS_OTHER_CURRENT_MAX CC_CNT_DRAWINGS_POS_CURRENT_SUM CC_CNT_DRAWINGS_POS_CURRENT_MEAN CC_CNT_DRAWINGS_POS_CURRENT_MAX CC_CNT_INSTALMENT_MATURE_CUM_MIN CC_SK_DPD_MAX CC_SK_DPD_DEF_MAX CC_NAME_CONTRACT_STATUS_SUM CC_NAME_CONTRACT_STATUS_MIN CC_TOTAL_INSTALMENTS CC_PERCENTAGE_MISSED_PAYMENTS CC_CREDIT_LOAD_y PREV_APP_SK_ID_PREV_COUNT PREV_APP_NAME_CONTRACT_TYPE_NUNIQUE PREV_APP_AMT_ANNUITY_MEAN PREV_APP_AMT_DOWN_PAYMENT_MEAN PREV_APP_AMT_DOWN_PAYMENT_SUM PREV_APP_AMT_GOODS_PRICE_MEAN PREV_APP_AMT_GOODS_PRICE_SUM PREV_APP_CNT_PAYMENT_MEAN PREV_APP_CNT_PAYMENT_SUM PREV_APP_CNT_PAYMENT_MIN PREV_APP_CNT_PAYMENT_MAX PREV_APP_NAME_CONTRACT_STATUS_Approved_SUM PREV_APP_NAME_CONTRACT_STATUS_Canceled_SUM PREV_APP_NAME_CONTRACT_STATUS_Refused_SUM PREV_APP_NAME_CONTRACT_STATUS_Unused offer_SUM PREV_APP_NAME_CONTRACT_STATUS_REFUSED_RATIO_ INST_PAY_AMT_PAYMENT_MEAN INST_PAY_AMT_PAYMENT_SUM INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_MEAN INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_MEAN INST_PAY_AMT_INSTAL_ACTUAL_DIFF_SUM INST_PAY_LATE_INSTAL_Binary_1_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_0_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_1_SUM INST_PAY_LATE_INSTAL_RATIO_ INST_PAY_AMT_PARTIAL_INSTAL_RATIO_ Annuity_Income NEW_CREDIT_TO_GOODS_RATIO GOODS_INCOME LOAN_ACESS INCOME_TO_EMPLOYED_RATIO INCOME_TO_BIRTH_RATIO CNT_NON_CHILD TERM MEAN_BUILDING_SCORE_AVG TOTAL_BUILDING_SCORE_AVG NEW_DOC_STD NEW_DOC_KURT NEW_LIVE_SUM NEW_LIVE_STD NEW_LIVE_KURT AMT_REQ_CREDIT_BUREAU_TOTAL AGE_RANGE NEW_INC_BY_ORG NEW_EXT_SOURCES_MEAN Bureau_No_Record BUREAU_INCOME_CREDIT_RATIO CC_No_Record POS_No_Record NAME_CONTRACT_TYPE_Revolving loans CODE_GENDER_M FLAG_OWN_CAR_Y FLAG_OWN_REALTY_Y NAME_TYPE_SUITE_Family NAME_TYPE_SUITE_Group of people NAME_TYPE_SUITE_Other_A NAME_TYPE_SUITE_Other_B NAME_TYPE_SUITE_Spouse, partner NAME_TYPE_SUITE_Unaccompanied NAME_INCOME_TYPE_Commercial associate NAME_INCOME_TYPE_Pensioner NAME_INCOME_TYPE_State servant NAME_INCOME_TYPE_Student NAME_INCOME_TYPE_Unemployed NAME_INCOME_TYPE_Working NAME_EDUCATION_TYPE_Higher education NAME_EDUCATION_TYPE_Incomplete higher NAME_EDUCATION_TYPE_Lower secondary NAME_EDUCATION_TYPE_Secondary / secondary special NAME_FAMILY_STATUS_Married NAME_FAMILY_STATUS_Separated NAME_FAMILY_STATUS_Single / not married NAME_FAMILY_STATUS_Widow NAME_HOUSING_TYPE_House / apartment NAME_HOUSING_TYPE_Municipal apartment NAME_HOUSING_TYPE_Office apartment NAME_HOUSING_TYPE_Rented apartment NAME_HOUSING_TYPE_With parents OCCUPATION_TYPE_Cleaning staff OCCUPATION_TYPE_Cooking staff OCCUPATION_TYPE_Core staff OCCUPATION_TYPE_Drivers OCCUPATION_TYPE_HR staff OCCUPATION_TYPE_High skill tech staff OCCUPATION_TYPE_IT staff OCCUPATION_TYPE_Laborers OCCUPATION_TYPE_Low-skill Laborers OCCUPATION_TYPE_Managers OCCUPATION_TYPE_Medicine staff OCCUPATION_TYPE_Private service staff OCCUPATION_TYPE_Realty agents OCCUPATION_TYPE_Sales staff OCCUPATION_TYPE_Secretaries OCCUPATION_TYPE_Security staff OCCUPATION_TYPE_Waiters/barmen staff WEEKDAY_APPR_PROCESS_START_MONDAY WEEKDAY_APPR_PROCESS_START_SATURDAY WEEKDAY_APPR_PROCESS_START_SUNDAY WEEKDAY_APPR_PROCESS_START_THURSDAY WEEKDAY_APPR_PROCESS_START_TUESDAY WEEKDAY_APPR_PROCESS_START_WEDNESDAY ORGANIZATION_TYPE_Agriculture ORGANIZATION_TYPE_Bank ORGANIZATION_TYPE_Business Entity Type 1 ORGANIZATION_TYPE_Business Entity Type 2 ORGANIZATION_TYPE_Business Entity Type 3 ORGANIZATION_TYPE_Cleaning ORGANIZATION_TYPE_Construction ORGANIZATION_TYPE_Culture ORGANIZATION_TYPE_Electricity ORGANIZATION_TYPE_Emergency ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Hotel ORGANIZATION_TYPE_Housing ORGANIZATION_TYPE_Industry: type 1 ORGANIZATION_TYPE_Industry: type 10 ORGANIZATION_TYPE_Industry: type 11 ORGANIZATION_TYPE_Industry: type 12 ORGANIZATION_TYPE_Industry: type 13 ORGANIZATION_TYPE_Industry: type 2 ORGANIZATION_TYPE_Industry: type 3 ORGANIZATION_TYPE_Industry: type 4 ORGANIZATION_TYPE_Industry: type 5 ORGANIZATION_TYPE_Industry: type 6 ORGANIZATION_TYPE_Industry: type 7 ORGANIZATION_TYPE_Industry: type 8 ORGANIZATION_TYPE_Industry: type 9 ORGANIZATION_TYPE_Insurance ORGANIZATION_TYPE_Kindergarten ORGANIZATION_TYPE_Legal Services ORGANIZATION_TYPE_Medicine ORGANIZATION_TYPE_Military ORGANIZATION_TYPE_Mobile ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Police ORGANIZATION_TYPE_Postal ORGANIZATION_TYPE_Realtor ORGANIZATION_TYPE_Religion ORGANIZATION_TYPE_Restaurant ORGANIZATION_TYPE_School ORGANIZATION_TYPE_Security ORGANIZATION_TYPE_Security Ministries ORGANIZATION_TYPE_Self-employed ORGANIZATION_TYPE_Services ORGANIZATION_TYPE_Telecom ORGANIZATION_TYPE_Trade: type 1 ORGANIZATION_TYPE_Trade: type 2 ORGANIZATION_TYPE_Trade: type 3 ORGANIZATION_TYPE_Trade: type 4 ORGANIZATION_TYPE_Trade: type 5 ORGANIZATION_TYPE_Trade: type 6 ORGANIZATION_TYPE_Trade: type 7 ORGANIZATION_TYPE_Transport: type 1 ORGANIZATION_TYPE_Transport: type 2 ORGANIZATION_TYPE_Transport: type 3 ORGANIZATION_TYPE_Transport: type 4 ORGANIZATION_TYPE_University ORGANIZATION_TYPE_XNA HOUSETYPE_MODE_specific housing HOUSETYPE_MODE_terraced house WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_Yes
0 0.0 27652.5 450000.0 0.011703 -1063.0 -185.0 -3079.0 10.0 1.0 0.0 1.0 0.0 0.0 2.0 2.0 12.0 0.0 0.0 0.0 0.0 0.0 0.0 0.221539 0.617474 0.315472 0.0000 0.9717 0.00 0.0690 0.0000 0.0000 0.000 0.0017 0.0 0.0 0.0 -628.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0 2.0 0.714286 0.142857 26486.142857 -45.857143 0.0 0.0 2.670858e+05 1869600.645 5050.285714 35352.0 0.0 78750.0 393750.0 39375.0 7207.3125 0.0 -17.0 0.0 9.142857 64.0 -1.0 0.0 -0.166667 0.100412 6956.000000 34780.0 961.000000 1922.0 0.0 0.0 0.428571 3.0 0.571429 4.0 0.0 0.0 0.0 0.0 1.0 -0.505618 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0 5882.08500 8943.75 17887.5 59107.5 118215.0 12.0 24.0 12.0 12.0 2.0 0.0 0.0 0.0 0.000000 27611.190000 110444.760 -21.500000 -86.0 0.000000 0.00 0.0 4.0 0.0 0.000000 0.000000 0.204833 1.198000 3.333333 89100.0 -126.999059 -12.998267 2.0 19.495525 0.121479 1.7007 0.223607 20.0 3.0 0.547723 -3.333333 1.0 2.0 157500.0 0.384829 0.0 1.978413 1.0 0.0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
1 0.0 29511.0 328500.0 0.003122 -1458.0 -242.0 -2413.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 3.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.505143 0.618197 0.537070 0.0503 0.9816 0.00 0.1379 0.1667 0.0490 0.000 0.0467 0.0 1.0 0.0 -739.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.000000e+00 0.000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 -1.736842 0.0 0.052632 0.052632 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 24769.53000 2947.50 0.0 315000.0 315000.0 36.0 36.0 36.0 36.0 1.0 0.0 0.0 0.0 0.000000 28760.293800 719007.345 -3.840000 -96.0 3963.124800 99078.12 10.0 17.0 8.0 0.400000 0.320000 0.262320 1.132000 2.920000 43362.0 -77.160494 -5.964373 1.0 12.600793 0.179686 2.5156 0.223607 20.0 3.0 0.547723 -3.333333 0.0 4.0 135000.0 0.618197 1.0 0.000000 1.0 0.0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
2 0.0 22986.0 283500.0 0.010500 -1653.0 -9788.0 -4157.0 19.0 0.0 0.0 1.0 0.0 0.0 2.0 3.0 9.0 0.0 0.0 0.0 0.0 0.0 0.0 0.505143 0.219295 0.673830 0.0759 0.9816 0.00 0.1379 0.1667 0.0487 0.003 0.0688 0.0 2.0 0.0 -1000.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0 2.0 0.400000 0.600000 12824.000000 -435.400000 0.0 0.0 3.155391e+04 157769.550 0.000000 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 -35.0 0.0 15.400000 0.0 -1.0 0.0 -0.277990 0.000000 779.200000 3896.0 18.666667 56.0 0.0 0.0 0.600000 3.0 0.400000 2.0 0.0 0.0 0.0 0.0 1.0 -0.182755 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 2.0 13948.76700 8634.00 25902.0 96999.3 484996.5 8.0 40.0 6.0 12.0 5.0 0.0 0.0 0.0 0.000000 14040.654868 533544.885 -21.210526 -806.0 0.000000 0.00 0.0 38.0 0.0 0.000000 0.000000 0.255400 1.158397 3.150000 44905.5 -96.926805 -3.967204 2.0 14.287197 0.207375 0.0000 0.223607 20.0 3.0 0.516398 -1.875000 3.0 4.0 117000.0 0.446563 0.0 0.350599 1.0 0.0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0
3 2.0 34173.0 450000.0 0.018209 -5051.0 -4311.0 -3675.0 15.0 1.0 1.0 1.0 0.0 0.0 4.0 3.0 21.0 0.0 0.0 0.0 0.0 0.0 0.0 0.394391 0.400263 0.726711 0.1192 0.9811 0.24 0.2069 0.3333 0.1176 0.000 0.1924 1.0 5.0 0.0 -1515.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0 2.0 0.000000 1.000000 409352.250000 -792.250000 0.0 0.0 2.758369e+05 1103347.575 0.000000 0.0 0.0 0.0 0.0 38812.5 12937.5000 0.0 -69.0 0.0 42.750000 171.0 -1.0 0.0 -0.811189 0.000000 436.250000 1745.0 7.500000 30.0 0.0 0.0 0.750000 3.0 0.250000 1.0 0.0 0.0 0.0 0.0 1.0 -0.066599 0.0 0.117647 0.117647 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0 5757.81750 7647.75 15295.5 42457.5 84915.0 7.0 14.0 4.0 10.0 2.0 0.0 0.0 0.0 0.000000 4328.507250 86570.145 -8.000000 -160.0 1925.239500 38504.79 7.0 8.0 12.0 0.350000 0.600000 0.379700 1.132000 5.000000 59400.0 -17.818254 -7.506255 2.0 14.906505 0.269136 3.7679 0.223607 20.0 3.0 0.516398 -1.875000 2.0 2.0 157500.0 0.507122 0.0 3.064854 1.0 0.0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
4 0.0 48208.5 1305000.0 0.001276 -1817.0 -8499.0 -4347.0 0.0 1.0 1.0 1.0 1.0 0.0 2.0 2.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 0.505143 0.582560 0.729567 0.0934 0.9876 0.00 0.2069 0.1667 0.0594 0.000 0.0265 0.0 0.0 0.0 -1250.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 1.0 0.000000 0.750000 850249.583333 -995.250000 0.0 0.0 1.238625e+06 4954500.000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 -35.0 0.0 15.400000 0.0 -1.0 0.0 -0.277990 0.000000 1340.666667 4022.0 234.666667 704.0 0.0 0.0 1.000000 4.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 -0.475908 0.0 0.015873 0.015873 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 2.0 23803.70625 0.00 0.0 308223.0 1232892.0 24.0 96.0 18.0 36.0 3.0 0.0 1.0 0.0 0.333333 16761.253269 1307377.755 -3.448718 -269.0 2685.403846 209461.50 26.0 48.0 30.0 0.333333 0.384615 0.133912 1.145200 3.625000 189486.0 -198.128784 -16.662038 2.0 31.000467 0.193679 2.7115 0.223607 20.0 3.0 0.408248 6.000000 2.0 4.0 157500.0 0.656063 0.0 3.440625 1.0 0.0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

Perform Transform on Kaggle Test dataset

Here we peform the transform on the test dataset based on the paramters that we gained using the fit funciton on train dataset

In [22]:
result_test = data_pipeline.transform(app_test)
1it [00:02,  2.00s/it]
0it [00:00, ?it/s]
bureau_processed.csv ... Read Completed
****************************************
1it [00:00,  2.87it/s]
0it [00:00, ?it/s]
pos_processed.csv ... Read Completed
****************************************
1it [00:01,  1.69s/it]
0it [00:00, ?it/s]
cc_processed.csv ... Read Completed
****************************************
1it [00:01,  1.06s/it]
0it [00:00, ?it/s]
previous_application_processed.csv ... Read Completed
****************************************
1it [00:00,  1.24it/s]
installments_payments_processed.csv ... Read Completed
****************************************
supplemetery files merged to main train/test file............
Feature Adder commpleted............
Abnormality removed............
Dimensionality reduced............
Imputing completed............
One Hot Encoding Performed............
In [23]:
print("Shape of final test file is : " , result_test.shape)
result_test.head()
Shape of final test file is :  (48744, 310)
Out[23]:
CNT_CHILDREN AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_3 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_11 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR BUREAU_LOAN_TYPES BUREAU_ACTIVE_LOANS_PERCENTAGE BUREAU_CREDIT_ENDDATE_PERCENTAGE BUREAU_VARIANCE_DAYS_FROM_LAST_APPLICATION BUREAU_DAYS_CREDIT_UPDATE_MEAN BUREAU_AVG_OVERDUE_DAYS BUREAU_AVG_CREDIT_OVERDUE BUREAU_AVG_CREDIT_AMT BUREAU_TOTAL_CREDIT_AMT BUREAU_AVG_CREDIT_DEBT BUREAU_TOTAL_CREDIT_DEBT BUREAU_AVG_CURRENT_AMT_DUE BUREAU_AVG_CREDIT_CARD_LIMIT BUREAU_TOTAL_CREDIT_CARD_LIMIT BUREAU_MAX_ANNUITY BUREAU_AVG_ANNUITY BUREAU_TOTAL_PROLONGED_CREDIT BUREAU_MONTHS_BALANCE_MIN BUREAU_MONTHS_BALANCE_MAX BUREAU_MONTHS_BALANCE_SIZE_MEAN BUREAU_MONTHS_BALANCE_SIZE_SUM BUREAU_STATUS_MIN BUREAU_STATUS_MAX BUREAU_STATUS_MEAN BUREAU_DEBT_PERCENTAGE_MEAN BUREAU_CREDIT_DURATION_MEAN BUREAU_CREDIT_DURATION_SUM BUREAU_ENDDATE_DIF_MEAN BUREAU_ENDDATE_DIF_SUM BUREAU_AVG_COUNT_CAR_LOAN BUREAU_TOT_COUNT_CAR_LOAN BUREAU_AVG_COUNT_CONSUMER_LOAN BUREAU_TOT_COUNT_CONSUMER_LOAN BUREAU_AVG_COUNT_CREDIT_CARD_LOAN BUREAU_TOT_COUNT_CREDIT_CARD_LOAN BUREAU_AVG_COUNT_MORTGAGE BUREAU_TOT_COUNT_MORTGAGE BUREAU_AVG_COUNT_RARE_LOAN BUREAU_TOT_COUNT_RARE_LOAN POS_ACT_CONTRACTS POS_Fut_Install_Mon_Ratio POS_REM_INSTALMENT POS_PAID_LATE POS_PAID_LATE_TOL CC_AMT_CREDIT_LIMIT_ACTUAL_SUM CC_AMT_CREDIT_LIMIT_ACTUAL_MEAN CC_AMT_DRAWINGS_ATM_CURRENT_SUM CC_AMT_DRAWINGS_ATM_CURRENT_MEAN CC_AMT_DRAWINGS_ATM_CURRENT_MIN CC_AMT_DRAWINGS_ATM_CURRENT_MAX CC_AMT_DRAWINGS_CURRENT_MIN CC_AMT_DRAWINGS_CURRENT_MAX CC_AMT_DRAWINGS_OTHER_CURRENT_MEAN CC_AMT_DRAWINGS_OTHER_CURRENT_MIN CC_AMT_DRAWINGS_OTHER_CURRENT_MAX CC_AMT_DRAWINGS_POS_CURRENT_SUM CC_AMT_DRAWINGS_POS_CURRENT_MEAN CC_AMT_DRAWINGS_POS_CURRENT_MIN CC_AMT_DRAWINGS_POS_CURRENT_MAX CC_AMT_INST_MIN_REGULARITY_MIN CC_AMT_PAYMENT_CURRENT_MIN CC_AMT_PAYMENT_TOTAL_CURRENT_SUM CC_AMT_PAYMENT_TOTAL_CURRENT_MEAN CC_AMT_PAYMENT_TOTAL_CURRENT_MIN CC_AMT_PAYMENT_TOTAL_CURRENT_MAX CC_AMT_TOTAL_RECEIVABLE_SUM CC_AMT_TOTAL_RECEIVABLE_MIN CC_AMT_TOTAL_RECEIVABLE_MAX CC_CNT_DRAWINGS_ATM_CURRENT_SUM CC_CNT_DRAWINGS_ATM_CURRENT_MEAN CC_CNT_DRAWINGS_OTHER_CURRENT_MEAN CC_CNT_DRAWINGS_OTHER_CURRENT_MAX CC_CNT_DRAWINGS_POS_CURRENT_SUM CC_CNT_DRAWINGS_POS_CURRENT_MEAN CC_CNT_DRAWINGS_POS_CURRENT_MAX CC_CNT_INSTALMENT_MATURE_CUM_MIN CC_SK_DPD_MAX CC_SK_DPD_DEF_MAX CC_NAME_CONTRACT_STATUS_SUM CC_NAME_CONTRACT_STATUS_MIN CC_TOTAL_INSTALMENTS CC_PERCENTAGE_MISSED_PAYMENTS CC_CREDIT_LOAD_y PREV_APP_SK_ID_PREV_COUNT PREV_APP_NAME_CONTRACT_TYPE_NUNIQUE PREV_APP_AMT_ANNUITY_MEAN PREV_APP_AMT_DOWN_PAYMENT_MEAN PREV_APP_AMT_DOWN_PAYMENT_SUM PREV_APP_AMT_GOODS_PRICE_MEAN PREV_APP_AMT_GOODS_PRICE_SUM PREV_APP_CNT_PAYMENT_MEAN PREV_APP_CNT_PAYMENT_SUM PREV_APP_CNT_PAYMENT_MIN PREV_APP_CNT_PAYMENT_MAX PREV_APP_NAME_CONTRACT_STATUS_Approved_SUM PREV_APP_NAME_CONTRACT_STATUS_Canceled_SUM PREV_APP_NAME_CONTRACT_STATUS_Refused_SUM PREV_APP_NAME_CONTRACT_STATUS_Unused offer_SUM PREV_APP_NAME_CONTRACT_STATUS_REFUSED_RATIO_ INST_PAY_AMT_PAYMENT_MEAN INST_PAY_AMT_PAYMENT_SUM INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_MEAN INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_MEAN INST_PAY_AMT_INSTAL_ACTUAL_DIFF_SUM INST_PAY_LATE_INSTAL_Binary_1_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_0_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_1_SUM INST_PAY_LATE_INSTAL_RATIO_ INST_PAY_AMT_PARTIAL_INSTAL_RATIO_ Annuity_Income NEW_CREDIT_TO_GOODS_RATIO GOODS_INCOME LOAN_ACESS INCOME_TO_EMPLOYED_RATIO INCOME_TO_BIRTH_RATIO CNT_NON_CHILD TERM MEAN_BUILDING_SCORE_AVG TOTAL_BUILDING_SCORE_AVG NEW_DOC_STD NEW_DOC_KURT NEW_LIVE_SUM NEW_LIVE_STD NEW_LIVE_KURT AMT_REQ_CREDIT_BUREAU_TOTAL AGE_RANGE NEW_INC_BY_ORG NEW_EXT_SOURCES_MEAN Bureau_No_Record BUREAU_INCOME_CREDIT_RATIO CC_No_Record POS_No_Record NAME_CONTRACT_TYPE_Revolving loans CODE_GENDER_M FLAG_OWN_CAR_Y FLAG_OWN_REALTY_Y NAME_TYPE_SUITE_Family NAME_TYPE_SUITE_Group of people NAME_TYPE_SUITE_Other_A NAME_TYPE_SUITE_Other_B NAME_TYPE_SUITE_Spouse, partner NAME_TYPE_SUITE_Unaccompanied NAME_INCOME_TYPE_Commercial associate NAME_INCOME_TYPE_Pensioner NAME_INCOME_TYPE_State servant NAME_INCOME_TYPE_Student NAME_INCOME_TYPE_Unemployed NAME_INCOME_TYPE_Working NAME_EDUCATION_TYPE_Higher education NAME_EDUCATION_TYPE_Incomplete higher NAME_EDUCATION_TYPE_Lower secondary NAME_EDUCATION_TYPE_Secondary / secondary special NAME_FAMILY_STATUS_Married NAME_FAMILY_STATUS_Separated NAME_FAMILY_STATUS_Single / not married NAME_FAMILY_STATUS_Widow NAME_HOUSING_TYPE_House / apartment NAME_HOUSING_TYPE_Municipal apartment NAME_HOUSING_TYPE_Office apartment NAME_HOUSING_TYPE_Rented apartment NAME_HOUSING_TYPE_With parents OCCUPATION_TYPE_Cleaning staff OCCUPATION_TYPE_Cooking staff OCCUPATION_TYPE_Core staff OCCUPATION_TYPE_Drivers OCCUPATION_TYPE_HR staff OCCUPATION_TYPE_High skill tech staff OCCUPATION_TYPE_IT staff OCCUPATION_TYPE_Laborers OCCUPATION_TYPE_Low-skill Laborers OCCUPATION_TYPE_Managers OCCUPATION_TYPE_Medicine staff OCCUPATION_TYPE_Private service staff OCCUPATION_TYPE_Realty agents OCCUPATION_TYPE_Sales staff OCCUPATION_TYPE_Secretaries OCCUPATION_TYPE_Security staff OCCUPATION_TYPE_Waiters/barmen staff WEEKDAY_APPR_PROCESS_START_MONDAY WEEKDAY_APPR_PROCESS_START_SATURDAY WEEKDAY_APPR_PROCESS_START_SUNDAY WEEKDAY_APPR_PROCESS_START_THURSDAY WEEKDAY_APPR_PROCESS_START_TUESDAY WEEKDAY_APPR_PROCESS_START_WEDNESDAY ORGANIZATION_TYPE_Agriculture ORGANIZATION_TYPE_Bank ORGANIZATION_TYPE_Business Entity Type 1 ORGANIZATION_TYPE_Business Entity Type 2 ORGANIZATION_TYPE_Business Entity Type 3 ORGANIZATION_TYPE_Cleaning ORGANIZATION_TYPE_Construction ORGANIZATION_TYPE_Culture ORGANIZATION_TYPE_Electricity ORGANIZATION_TYPE_Emergency ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Hotel ORGANIZATION_TYPE_Housing ORGANIZATION_TYPE_Industry: type 1 ORGANIZATION_TYPE_Industry: type 10 ORGANIZATION_TYPE_Industry: type 11 ORGANIZATION_TYPE_Industry: type 12 ORGANIZATION_TYPE_Industry: type 13 ORGANIZATION_TYPE_Industry: type 2 ORGANIZATION_TYPE_Industry: type 3 ORGANIZATION_TYPE_Industry: type 4 ORGANIZATION_TYPE_Industry: type 5 ORGANIZATION_TYPE_Industry: type 6 ORGANIZATION_TYPE_Industry: type 7 ORGANIZATION_TYPE_Industry: type 8 ORGANIZATION_TYPE_Industry: type 9 ORGANIZATION_TYPE_Insurance ORGANIZATION_TYPE_Kindergarten ORGANIZATION_TYPE_Legal Services ORGANIZATION_TYPE_Medicine ORGANIZATION_TYPE_Military ORGANIZATION_TYPE_Mobile ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Police ORGANIZATION_TYPE_Postal ORGANIZATION_TYPE_Realtor ORGANIZATION_TYPE_Religion ORGANIZATION_TYPE_Restaurant ORGANIZATION_TYPE_School ORGANIZATION_TYPE_Security ORGANIZATION_TYPE_Security Ministries ORGANIZATION_TYPE_Self-employed ORGANIZATION_TYPE_Services ORGANIZATION_TYPE_Telecom ORGANIZATION_TYPE_Trade: type 1 ORGANIZATION_TYPE_Trade: type 2 ORGANIZATION_TYPE_Trade: type 3 ORGANIZATION_TYPE_Trade: type 4 ORGANIZATION_TYPE_Trade: type 5 ORGANIZATION_TYPE_Trade: type 6 ORGANIZATION_TYPE_Trade: type 7 ORGANIZATION_TYPE_Transport: type 1 ORGANIZATION_TYPE_Transport: type 2 ORGANIZATION_TYPE_Transport: type 3 ORGANIZATION_TYPE_Transport: type 4 ORGANIZATION_TYPE_University ORGANIZATION_TYPE_XNA HOUSETYPE_MODE_specific housing HOUSETYPE_MODE_terraced house WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_Yes
0 0.0 20560.5 450000.0 0.018850 -2329.0 -5170.0 -812.0 0.0 1.0 0.0 1.0 0.0 1.0 2.0 2.0 18.0 0.0 0.0 0.0 0.0 0.0 0.0 0.752614 0.789654 0.159520 0.0590 0.9732 0.00 0.1379 0.1250 0.0487 0.0030 0.0392 0.0 0.0 0.0 -1740.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.428571 0.571429 240043.666667 -93.142857 0.0 0.0 207623.571429 1453365.00 85240.928571 596686.5 0.0 0.000000 0.00 10822.50 3545.357143 0.0 -51.0 0.0 24.571429 172.0 -1.0 0.0 -0.543363 0.282518 817.428571 5722.0 197.000000 788.0 0.0 0.0 1.000000 7.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 -0.019908 0.0 0.111111 0.111111 0.0 0.00 0.0 0.000000 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.000 0.0 0.00 0.0 0.0 0.000 0.000000 0.0 0.0 0.000 0.00 0.000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 1.0 1.0 3951.000 2520.0 2520.0 24835.5 24835.5 8.000000 8.0 8.0 8.0 1.0 0.0 0.0 0.0 0.0 5885.132143 41195.925 -7.285714 -51.0 0.000000 0.000 1.0 7.0 0.0 0.142857 0.000000 0.152300 1.2640 3.333333 118800.0 -57.964792 -7.016267 2.0 27.664697 0.235267 1.4116 0.223607 20.0 3.0 0.516398 -1.875000 0.0 4.0 135000.0 0.567263 0.0 1.537952 1.0 0.0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 0.0 17370.0 180000.0 0.035792 -4469.0 -9118.0 -1623.0 0.0 1.0 0.0 1.0 0.0 0.0 2.0 2.0 9.0 0.0 0.0 0.0 0.0 0.0 0.0 0.564990 0.291656 0.432962 0.0759 0.9816 0.00 0.1379 0.1667 0.0487 0.0030 0.0688 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 2.0 0.666667 0.333333 26340.333333 -54.333333 0.0 0.0 219042.000000 657126.00 189469.500000 568408.5 0.0 0.000000 0.00 4261.50 1420.500000 0.0 -12.0 0.0 7.000000 21.0 -1.0 0.0 -0.416667 0.601256 630.000000 1890.0 -5.000000 -5.0 0.0 0.0 0.666667 2.0 0.333333 1.0 0.0 0.0 0.0 0.0 1.0 -0.327273 0.0 0.000000 0.000000 0.0 0.00 0.0 0.000000 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.000 0.0 0.00 0.0 0.0 0.000 0.000000 0.0 0.0 0.000 0.00 0.000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 2.0 2.0 4813.200 4464.0 4464.0 44617.5 44617.5 12.000000 12.0 12.0 12.0 1.0 1.0 0.0 0.0 0.0 6240.205000 56161.845 -23.555556 -212.0 0.000000 0.000 1.0 9.0 0.0 0.111111 0.000000 0.175455 1.2376 1.818182 42768.0 -22.152607 -5.480514 2.0 12.824870 0.207375 0.0000 0.223607 20.0 3.0 0.547723 -3.333333 3.0 3.0 157500.0 0.429869 0.0 2.212545 1.0 0.0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
2 0.0 69777.0 630000.0 0.019101 -4458.0 -2175.0 -3503.0 5.0 1.0 0.0 1.0 0.0 0.0 2.0 2.0 14.0 0.0 0.0 0.0 0.0 0.0 0.0 0.505143 0.699787 0.610991 0.0759 0.9816 0.00 0.1379 0.1667 0.0487 0.0030 0.0688 0.0 0.0 0.0 -856.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 4.0 2.0 0.000000 1.000000 155208.333333 -775.500000 0.0 19305.0 518070.015000 2072280.06 0.000000 0.0 0.0 0.000000 0.00 0.00 0.000000 0.0 -68.0 0.0 57.500000 230.0 -1.0 0.0 -0.736842 0.000000 669.500000 2678.0 -13.250000 -53.0 0.5 2.0 0.500000 2.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 -0.517857 0.0 0.055556 0.000000 12645000.0 131718.75 571500.0 6350.000000 0.0 157500.0 0.0 157500.00 0.0 0.0 0.0 0.00 0.000 0.0 0.00 0.0 0.0 654448.545 6817.172344 0.0 153675.0 1737703.665 -274.32 161420.220 23.0 0.255556 0.0 0.0 0.0 0.000000 0.0 1.0 1.0 1.0 96.0 1.0 22.0 1.041667 1.024890 4.0 2.0 11478.195 3375.0 6750.0 174495.0 523485.0 17.333333 52.0 6.0 36.0 3.0 1.0 0.0 0.0 0.0 9740.235774 1509736.545 -5.180645 -803.0 1157.662742 179437.725 11.0 135.0 20.0 0.070968 0.129032 0.344578 1.0528 3.111111 33264.0 -45.423957 -10.105799 2.0 9.505482 0.207375 0.0000 0.223607 20.0 3.0 0.547723 -3.333333 5.0 4.0 180000.0 0.655389 0.0 2.558370 0.0 0.0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
3 2.0 49018.5 1575000.0 0.026392 -1866.0 -2000.0 -4208.0 0.0 1.0 0.0 1.0 1.0 0.0 4.0 2.0 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.525734 0.509677 0.612704 0.1974 0.9970 0.32 0.2759 0.3750 0.2078 0.0817 0.3700 0.0 0.0 0.0 -1805.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 2.0 0.416667 0.583333 430355.840909 -651.500000 0.0 0.0 126739.590000 1520875.08 18630.450000 186304.5 0.0 14484.394286 101390.76 12897.09 3012.010714 0.0 -69.0 0.0 46.666667 560.0 -1.0 0.0 -0.538289 0.122267 3932.700000 39327.0 61.166667 367.0 0.0 0.0 0.583333 7.0 0.416667 5.0 0.0 0.0 0.0 0.0 1.0 -0.241353 0.0 0.000000 0.000000 11025000.0 225000.00 27000.0 613.636364 0.0 18000.0 0.0 22823.55 0.0 0.0 0.0 274663.62 6242.355 0.0 22823.55 0.0 0.0 274701.465 5606.152347 0.0 15750.0 390461.850 0.00 36980.415 2.0 0.045455 0.0 0.0 115.0 2.613636 12.0 1.0 0.0 0.0 49.0 1.0 35.0 2.040816 0.165937 5.0 3.0 8091.585 3750.0 11250.0 82012.5 246037.5 11.333333 34.0 0.0 24.0 3.0 1.0 0.0 1.0 0.0 4356.731549 492310.665 -3.000000 -339.0 622.550708 70348.230 12.0 93.0 20.0 0.106195 0.176991 0.155614 1.0000 5.000000 0.0 -168.810289 -22.538638 2.0 32.130726 0.322743 4.5184 0.223607 20.0 3.0 0.516398 -1.875000 3.0 2.0 180000.0 0.549372 0.0 0.402348 0.0 0.0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
4 1.0 32067.0 625500.0 0.010032 -2191.0 -4000.0 -4262.0 16.0 1.0 1.0 1.0 0.0 0.0 3.0 2.0 5.0 0.0 0.0 0.0 0.0 1.0 1.0 0.202145 0.425687 0.537070 0.0759 0.9816 0.00 0.1379 0.1667 0.0487 0.0030 0.0688 0.0 0.0 0.0 -821.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.00 0.000000 0.0 0.0 0.000000 0.00 0.00 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 -0.278388 0.0 0.000000 0.000000 0.0 0.00 0.0 0.000000 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.000 0.0 0.00 0.0 0.0 0.000 0.000000 0.0 0.0 0.000 0.00 0.000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 2.0 2.0 17782.155 8095.5 8095.5 267727.5 535455.0 24.000000 48.0 12.0 36.0 1.0 1.0 0.0 0.0 0.0 11100.337500 133204.050 -12.250000 -147.0 0.000000 0.000 0.0 12.0 0.0 0.000000 0.000000 0.178150 1.0000 3.475000 0.0 -82.154267 -13.803681 2.0 19.506034 0.207375 0.0000 0.223607 20.0 3.0 0.516398 -1.875000 0.0 2.0 180000.0 0.313916 1.0 0.000000 1.0 0.0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

After fit and tranform we extract these to files into a CSV file and save on the local drive

In [24]:
result_train.to_csv(path + 'Final_Files/train_final.csv', index = False)
In [25]:
result_test.to_csv(path + 'Final_Files/test_final.csv', index = False)

Here we devlop kaggle train dataset that would help us in Hypeparameter tunning and assist us in devloping a robust model for our Kaggle

In [26]:
full_train = data_pipeline.fit_transform(app_train)
full_train.shape
1it [00:01,  1.97s/it]
0it [00:00, ?it/s]
bureau_processed.csv ... Read Completed
****************************************
1it [00:00,  3.05it/s]
0it [00:00, ?it/s]
pos_processed.csv ... Read Completed
****************************************
1it [00:01,  1.60s/it]
0it [00:00, ?it/s]
cc_processed.csv ... Read Completed
****************************************
1it [00:00,  1.00it/s]
0it [00:00, ?it/s]
previous_application_processed.csv ... Read Completed
****************************************
1it [00:00,  1.27it/s]
installments_payments_processed.csv ... Read Completed
****************************************
supplemetery files merged to main train/test file............
Feature Adder commpleted............
Abnormality removed............
Dimensionality reduced............
Columns split completed............
Imputing completed............
One Hot Encoding Performed............
Out[26]:
(307511, 312)
In [ ]:
#Export data into local drive for further model building
full_train.to_csv(path + 'Final_Files/Full_train.csv', index = False)

Conclusion of PART 2

This part we have developed end to end pre-processing pipelines and transformed datasets for further model building.

Final Preprocessed files have 311 features.For the original 121 features from the original app train dataset, we have added 190 features for further analysis.

Part 3, includes further feature engineering and developing model using this transformed dataset



PART 3 - MODEL EVALUATION

Introduction and Summary of Result

PART 3 of Phase2 project includes several model evaluation. During Phase1 we have tested Logistic Regression, Random Forest and Gradient Boosting Classifiers. Based on phase1 result, it was clear that gradient boosting decision tree models where performing better than logistic regression or random forest model.

Following case studies were evaluated for this phase. There are more than 25000 experiments performed using gridsearch and various pre-processing and balancing ratios.

Model XGBOOST LGBM XGBOOST KAGGLE SCORE LGBM KAGGLE SCORE
Unbalanced Dataset 360 Experiments 3240 Experiments 0.792 0.787
Manual Balanced Dataset (0.25 ratio) 360 Experiments 3240 Experiments 0.779 0.790
Manual Balanced Dataset (0.33 ratio) 360 Experiments 3240 Experiments 0.790 0.791
Manual Balanced Dataset (0.5 ratio) 360 Experiments 3240 Experiments 0.783 0.789
Numerically Scaled - Unbalanced Dataset 360 Experiments 3240 Experiments 0.789 0.787
PCA - Unbalanced Dataset - 180 Experiments - 0.600
SMOTE - Synthetic Balance 360 Experiments 9720 Experiments 0.773 0.794

Out of All Case studies, Best model based on Kaggle score is Synthetically balanced dataset with LightGBM with kaggle score 0.794. Second closet results is unbalanced dataset. Using LightGBM, kaggle score for second best model is 0.792

For list of complete experiment logs refer Discussion of Results

Reading Files

In Part 1 and Part 2 we have performed lot of feature engineering and pre-processing steps. We can take advantage of that in this part. We will read final merged and pre-processed files and use it to further train and test model.

In [1]:
import json

from sklearn.model_selection import GridSearchCV, train_test_split, ParameterGrid

from sklearn.linear_model import LogisticRegression
# from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
# from catboost import CatBoostClassifier
from xgboost import XGBClassifier


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import accuracy_score, f1_score, roc_curve, auc, \
roc_auc_score, confusion_matrix, plot_roc_curve, plot_confusion_matrix

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

from tqdm import tqdm
from time import time
from datetime import datetime

pd.set_option('display.max_columns', 500)
In [6]:
path = "data/final_processed/"
def read_files(list_of_files, path, print_details = True):
    df_list = []
    for file in list_of_files:
        chunksize = 500000
        filename = path + file
        
        i = 1
        for chunk in tqdm(pd.read_csv(filename, chunksize=chunksize, low_memory=False)):
            df = chunk if i == 1 else pd.concat([df, chunk])
            if print_details:
                print('-->Read Chunk...', i)
            i += 1
        df_list.append(df)
        print(file + " ... Read Completed")
        print('*'*40)
    return df_list
In [7]:
## Reading Application train and test file. 

file_list = ['train_final.csv', 'test_final.csv']


#app_test, app_train, installment, bureau_balance, pos_cash, bureau, prev_app, cc_bal = read_files(file_list, path, print_details=False)
df_train, df_test = read_files(file_list, path, print_details=False)
1it [00:11, 11.43s/it]
0it [00:00, ?it/s]
train_final.csv ... Read Completed
****************************************
1it [00:03,  3.35s/it]
test_final.csv ... Read Completed
****************************************

Exploring Data

Currently pre-processed files have 310 features. One column represent TARGET variable.

In [8]:
print('Train Data Shape is :', df_train.shape)
print('Test Data Shape is :', df_test.shape)
Train Data Shape is : (153755, 311)
Test Data Shape is : (48744, 310)
In [5]:
df_train.head()
Out[5]:
CNT_CHILDREN AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_3 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_11 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR TARGET BUREAU_LOAN_TYPES BUREAU_ACTIVE_LOANS_PERCENTAGE BUREAU_CREDIT_ENDDATE_PERCENTAGE BUREAU_VARIANCE_DAYS_FROM_LAST_APPLICATION BUREAU_DAYS_CREDIT_UPDATE_MEAN BUREAU_AVG_OVERDUE_DAYS BUREAU_AVG_CREDIT_OVERDUE BUREAU_AVG_CREDIT_AMT BUREAU_TOTAL_CREDIT_AMT BUREAU_AVG_CREDIT_DEBT BUREAU_TOTAL_CREDIT_DEBT BUREAU_AVG_CURRENT_AMT_DUE BUREAU_AVG_CREDIT_CARD_LIMIT BUREAU_TOTAL_CREDIT_CARD_LIMIT BUREAU_MAX_ANNUITY BUREAU_AVG_ANNUITY BUREAU_TOTAL_PROLONGED_CREDIT BUREAU_MONTHS_BALANCE_MIN BUREAU_MONTHS_BALANCE_MAX BUREAU_MONTHS_BALANCE_SIZE_MEAN BUREAU_MONTHS_BALANCE_SIZE_SUM BUREAU_STATUS_MIN BUREAU_STATUS_MAX BUREAU_STATUS_MEAN BUREAU_DEBT_PERCENTAGE_MEAN BUREAU_CREDIT_DURATION_MEAN BUREAU_CREDIT_DURATION_SUM BUREAU_ENDDATE_DIF_MEAN BUREAU_ENDDATE_DIF_SUM BUREAU_AVG_COUNT_CAR_LOAN BUREAU_TOT_COUNT_CAR_LOAN BUREAU_AVG_COUNT_CONSUMER_LOAN BUREAU_TOT_COUNT_CONSUMER_LOAN BUREAU_AVG_COUNT_CREDIT_CARD_LOAN BUREAU_TOT_COUNT_CREDIT_CARD_LOAN BUREAU_AVG_COUNT_MORTGAGE BUREAU_TOT_COUNT_MORTGAGE BUREAU_AVG_COUNT_RARE_LOAN BUREAU_TOT_COUNT_RARE_LOAN POS_ACT_CONTRACTS POS_Fut_Install_Mon_Ratio POS_REM_INSTALMENT POS_PAID_LATE POS_PAID_LATE_TOL CC_AMT_CREDIT_LIMIT_ACTUAL_SUM CC_AMT_CREDIT_LIMIT_ACTUAL_MEAN CC_AMT_DRAWINGS_ATM_CURRENT_SUM CC_AMT_DRAWINGS_ATM_CURRENT_MEAN CC_AMT_DRAWINGS_ATM_CURRENT_MIN CC_AMT_DRAWINGS_ATM_CURRENT_MAX CC_AMT_DRAWINGS_CURRENT_MIN CC_AMT_DRAWINGS_CURRENT_MAX CC_AMT_DRAWINGS_OTHER_CURRENT_MEAN CC_AMT_DRAWINGS_OTHER_CURRENT_MIN CC_AMT_DRAWINGS_OTHER_CURRENT_MAX CC_AMT_DRAWINGS_POS_CURRENT_SUM CC_AMT_DRAWINGS_POS_CURRENT_MEAN CC_AMT_DRAWINGS_POS_CURRENT_MIN CC_AMT_DRAWINGS_POS_CURRENT_MAX CC_AMT_INST_MIN_REGULARITY_MIN CC_AMT_PAYMENT_CURRENT_MIN CC_AMT_PAYMENT_TOTAL_CURRENT_SUM CC_AMT_PAYMENT_TOTAL_CURRENT_MEAN CC_AMT_PAYMENT_TOTAL_CURRENT_MIN CC_AMT_PAYMENT_TOTAL_CURRENT_MAX CC_AMT_TOTAL_RECEIVABLE_SUM CC_AMT_TOTAL_RECEIVABLE_MIN CC_AMT_TOTAL_RECEIVABLE_MAX CC_CNT_DRAWINGS_ATM_CURRENT_SUM CC_CNT_DRAWINGS_ATM_CURRENT_MEAN CC_CNT_DRAWINGS_OTHER_CURRENT_MEAN CC_CNT_DRAWINGS_OTHER_CURRENT_MAX CC_CNT_DRAWINGS_POS_CURRENT_SUM CC_CNT_DRAWINGS_POS_CURRENT_MEAN CC_CNT_DRAWINGS_POS_CURRENT_MAX CC_CNT_INSTALMENT_MATURE_CUM_MIN CC_SK_DPD_MAX CC_SK_DPD_DEF_MAX CC_NAME_CONTRACT_STATUS_SUM CC_NAME_CONTRACT_STATUS_MIN CC_TOTAL_INSTALMENTS CC_PERCENTAGE_MISSED_PAYMENTS CC_CREDIT_LOAD_y PREV_APP_SK_ID_PREV_COUNT PREV_APP_NAME_CONTRACT_TYPE_NUNIQUE PREV_APP_AMT_ANNUITY_MEAN PREV_APP_AMT_DOWN_PAYMENT_MEAN PREV_APP_AMT_DOWN_PAYMENT_SUM PREV_APP_AMT_GOODS_PRICE_MEAN PREV_APP_AMT_GOODS_PRICE_SUM PREV_APP_CNT_PAYMENT_MEAN PREV_APP_CNT_PAYMENT_SUM PREV_APP_CNT_PAYMENT_MIN PREV_APP_CNT_PAYMENT_MAX PREV_APP_NAME_CONTRACT_STATUS_Approved_SUM PREV_APP_NAME_CONTRACT_STATUS_Canceled_SUM PREV_APP_NAME_CONTRACT_STATUS_Refused_SUM PREV_APP_NAME_CONTRACT_STATUS_Unused offer_SUM PREV_APP_NAME_CONTRACT_STATUS_REFUSED_RATIO_ INST_PAY_AMT_PAYMENT_MEAN INST_PAY_AMT_PAYMENT_SUM INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_MEAN INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_MEAN INST_PAY_AMT_INSTAL_ACTUAL_DIFF_SUM INST_PAY_LATE_INSTAL_Binary_1_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_0_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_1_SUM INST_PAY_LATE_INSTAL_RATIO_ INST_PAY_AMT_PARTIAL_INSTAL_RATIO_ Annuity_Income NEW_CREDIT_TO_GOODS_RATIO GOODS_INCOME LOAN_ACESS INCOME_TO_EMPLOYED_RATIO INCOME_TO_BIRTH_RATIO CNT_NON_CHILD TERM MEAN_BUILDING_SCORE_AVG TOTAL_BUILDING_SCORE_AVG NEW_DOC_STD NEW_DOC_KURT NEW_LIVE_SUM NEW_LIVE_STD NEW_LIVE_KURT AMT_REQ_CREDIT_BUREAU_TOTAL AGE_RANGE NEW_INC_BY_ORG NEW_EXT_SOURCES_MEAN Bureau_No_Record BUREAU_INCOME_CREDIT_RATIO CC_No_Record POS_No_Record NAME_CONTRACT_TYPE_Revolving loans CODE_GENDER_M FLAG_OWN_CAR_Y FLAG_OWN_REALTY_Y NAME_TYPE_SUITE_Family NAME_TYPE_SUITE_Group of people NAME_TYPE_SUITE_Other_A NAME_TYPE_SUITE_Other_B NAME_TYPE_SUITE_Spouse, partner NAME_TYPE_SUITE_Unaccompanied NAME_INCOME_TYPE_Commercial associate NAME_INCOME_TYPE_Pensioner NAME_INCOME_TYPE_State servant NAME_INCOME_TYPE_Student NAME_INCOME_TYPE_Unemployed NAME_INCOME_TYPE_Working NAME_EDUCATION_TYPE_Higher education NAME_EDUCATION_TYPE_Incomplete higher NAME_EDUCATION_TYPE_Lower secondary NAME_EDUCATION_TYPE_Secondary / secondary special NAME_FAMILY_STATUS_Married NAME_FAMILY_STATUS_Separated NAME_FAMILY_STATUS_Single / not married NAME_FAMILY_STATUS_Widow NAME_HOUSING_TYPE_House / apartment NAME_HOUSING_TYPE_Municipal apartment NAME_HOUSING_TYPE_Office apartment NAME_HOUSING_TYPE_Rented apartment NAME_HOUSING_TYPE_With parents OCCUPATION_TYPE_Cleaning staff OCCUPATION_TYPE_Cooking staff OCCUPATION_TYPE_Core staff OCCUPATION_TYPE_Drivers OCCUPATION_TYPE_HR staff OCCUPATION_TYPE_High skill tech staff OCCUPATION_TYPE_IT staff OCCUPATION_TYPE_Laborers OCCUPATION_TYPE_Low-skill Laborers OCCUPATION_TYPE_Managers OCCUPATION_TYPE_Medicine staff OCCUPATION_TYPE_Private service staff OCCUPATION_TYPE_Realty agents OCCUPATION_TYPE_Sales staff OCCUPATION_TYPE_Secretaries OCCUPATION_TYPE_Security staff OCCUPATION_TYPE_Waiters/barmen staff WEEKDAY_APPR_PROCESS_START_MONDAY WEEKDAY_APPR_PROCESS_START_SATURDAY WEEKDAY_APPR_PROCESS_START_SUNDAY WEEKDAY_APPR_PROCESS_START_THURSDAY WEEKDAY_APPR_PROCESS_START_TUESDAY WEEKDAY_APPR_PROCESS_START_WEDNESDAY ORGANIZATION_TYPE_Agriculture ORGANIZATION_TYPE_Bank ORGANIZATION_TYPE_Business Entity Type 1 ORGANIZATION_TYPE_Business Entity Type 2 ORGANIZATION_TYPE_Business Entity Type 3 ORGANIZATION_TYPE_Cleaning ORGANIZATION_TYPE_Construction ORGANIZATION_TYPE_Culture ORGANIZATION_TYPE_Electricity ORGANIZATION_TYPE_Emergency ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Hotel ORGANIZATION_TYPE_Housing ORGANIZATION_TYPE_Industry: type 1 ORGANIZATION_TYPE_Industry: type 10 ORGANIZATION_TYPE_Industry: type 11 ORGANIZATION_TYPE_Industry: type 12 ORGANIZATION_TYPE_Industry: type 13 ORGANIZATION_TYPE_Industry: type 2 ORGANIZATION_TYPE_Industry: type 3 ORGANIZATION_TYPE_Industry: type 4 ORGANIZATION_TYPE_Industry: type 5 ORGANIZATION_TYPE_Industry: type 6 ORGANIZATION_TYPE_Industry: type 7 ORGANIZATION_TYPE_Industry: type 8 ORGANIZATION_TYPE_Industry: type 9 ORGANIZATION_TYPE_Insurance ORGANIZATION_TYPE_Kindergarten ORGANIZATION_TYPE_Legal Services ORGANIZATION_TYPE_Medicine ORGANIZATION_TYPE_Military ORGANIZATION_TYPE_Mobile ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Police ORGANIZATION_TYPE_Postal ORGANIZATION_TYPE_Realtor ORGANIZATION_TYPE_Religion ORGANIZATION_TYPE_Restaurant ORGANIZATION_TYPE_School ORGANIZATION_TYPE_Security ORGANIZATION_TYPE_Security Ministries ORGANIZATION_TYPE_Self-employed ORGANIZATION_TYPE_Services ORGANIZATION_TYPE_Telecom ORGANIZATION_TYPE_Trade: type 1 ORGANIZATION_TYPE_Trade: type 2 ORGANIZATION_TYPE_Trade: type 3 ORGANIZATION_TYPE_Trade: type 4 ORGANIZATION_TYPE_Trade: type 5 ORGANIZATION_TYPE_Trade: type 6 ORGANIZATION_TYPE_Trade: type 7 ORGANIZATION_TYPE_Transport: type 1 ORGANIZATION_TYPE_Transport: type 2 ORGANIZATION_TYPE_Transport: type 3 ORGANIZATION_TYPE_Transport: type 4 ORGANIZATION_TYPE_University ORGANIZATION_TYPE_XNA HOUSETYPE_MODE_specific housing HOUSETYPE_MODE_terraced house WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_Yes
0 0.0 27652.5 450000.0 0.011703 -1063.0 -185.0 -3079.0 10.0 1.0 0.0 1.0 0.0 0.0 2.0 2.0 12.0 0.0 0.0 0.0 0.0 0.0 0.0 0.221539 0.617474 0.315472 0.0000 0.9717 0.00 0.0690 0.0000 0.0000 0.000 0.0017 0.0 0.0 0.0 -628.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0 2.0 0.714286 0.142857 26486.142857 -45.857143 0.0 0.0 2.670858e+05 1869600.645 5050.285714 35352.0 0.0 78750.0 393750.0 39375.0 7207.3125 0.0 -17.0 0.0 9.142857 64.0 -1.0 0.0 -0.166667 0.100412 6956.000000 34780.0 961.000000 1922.0 0.0 0.0 0.428571 3.0 0.571429 4.0 0.0 0.0 0.0 0.0 1.0 -0.505618 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0 5882.08500 8943.75 17887.5 59107.5 118215.0 12.0 24.0 12.0 12.0 2.0 0.0 0.0 0.0 0.000000 27611.190000 110444.760 -21.500000 -86.0 0.000000 0.00 0.0 4.0 0.0 0.000000 0.000000 0.204833 1.198000 3.333333 89100.0 -126.999059 -12.998267 2.0 19.495525 0.121479 1.7007 0.223607 20.0 3.0 0.547723 -3.333333 1.0 2.0 157500.0 0.384829 0.0 1.978413 1.0 0.0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
1 0.0 29511.0 328500.0 0.003122 -1458.0 -242.0 -2413.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 3.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.505143 0.618197 0.537070 0.0503 0.9816 0.00 0.1379 0.1667 0.0490 0.000 0.0467 0.0 1.0 0.0 -739.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.000000e+00 0.000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 -1.736842 0.0 0.052632 0.052632 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 24769.53000 2947.50 0.0 315000.0 315000.0 36.0 36.0 36.0 36.0 1.0 0.0 0.0 0.0 0.000000 28760.293800 719007.345 -3.840000 -96.0 3963.124800 99078.12 10.0 17.0 8.0 0.400000 0.320000 0.262320 1.132000 2.920000 43362.0 -77.160494 -5.964373 1.0 12.600793 0.179686 2.5156 0.223607 20.0 3.0 0.547723 -3.333333 0.0 4.0 135000.0 0.618197 1.0 0.000000 1.0 0.0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
2 0.0 22986.0 283500.0 0.010500 -1653.0 -9788.0 -4157.0 19.0 0.0 0.0 1.0 0.0 0.0 2.0 3.0 9.0 0.0 0.0 0.0 0.0 0.0 0.0 0.505143 0.219295 0.673830 0.0759 0.9816 0.00 0.1379 0.1667 0.0487 0.003 0.0688 0.0 2.0 0.0 -1000.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0 2.0 0.400000 0.600000 12824.000000 -435.400000 0.0 0.0 3.155391e+04 157769.550 0.000000 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 -35.0 0.0 15.400000 0.0 -1.0 0.0 -0.277990 0.000000 779.200000 3896.0 18.666667 56.0 0.0 0.0 0.600000 3.0 0.400000 2.0 0.0 0.0 0.0 0.0 1.0 -0.182755 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 2.0 13948.76700 8634.00 25902.0 96999.3 484996.5 8.0 40.0 6.0 12.0 5.0 0.0 0.0 0.0 0.000000 14040.654868 533544.885 -21.210526 -806.0 0.000000 0.00 0.0 38.0 0.0 0.000000 0.000000 0.255400 1.158397 3.150000 44905.5 -96.926805 -3.967204 2.0 14.287197 0.207375 0.0000 0.223607 20.0 3.0 0.516398 -1.875000 3.0 4.0 117000.0 0.446563 0.0 0.350599 1.0 0.0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0
3 2.0 34173.0 450000.0 0.018209 -5051.0 -4311.0 -3675.0 15.0 1.0 1.0 1.0 0.0 0.0 4.0 3.0 21.0 0.0 0.0 0.0 0.0 0.0 0.0 0.394391 0.400263 0.726711 0.1192 0.9811 0.24 0.2069 0.3333 0.1176 0.000 0.1924 1.0 5.0 0.0 -1515.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0 2.0 0.000000 1.000000 409352.250000 -792.250000 0.0 0.0 2.758369e+05 1103347.575 0.000000 0.0 0.0 0.0 0.0 38812.5 12937.5000 0.0 -69.0 0.0 42.750000 171.0 -1.0 0.0 -0.811189 0.000000 436.250000 1745.0 7.500000 30.0 0.0 0.0 0.750000 3.0 0.250000 1.0 0.0 0.0 0.0 0.0 1.0 -0.066599 0.0 0.117647 0.117647 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0 5757.81750 7647.75 15295.5 42457.5 84915.0 7.0 14.0 4.0 10.0 2.0 0.0 0.0 0.0 0.000000 4328.507250 86570.145 -8.000000 -160.0 1925.239500 38504.79 7.0 8.0 12.0 0.350000 0.600000 0.379700 1.132000 5.000000 59400.0 -17.818254 -7.506255 2.0 14.906505 0.269136 3.7679 0.223607 20.0 3.0 0.516398 -1.875000 2.0 2.0 157500.0 0.507122 0.0 3.064854 1.0 0.0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
4 0.0 48208.5 1305000.0 0.001276 -1817.0 -8499.0 -4347.0 0.0 1.0 1.0 1.0 1.0 0.0 2.0 2.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 0.505143 0.582560 0.729567 0.0934 0.9876 0.00 0.2069 0.1667 0.0594 0.000 0.0265 0.0 0.0 0.0 -1250.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 1.0 0.000000 0.750000 850249.583333 -995.250000 0.0 0.0 1.238625e+06 4954500.000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 -35.0 0.0 15.400000 0.0 -1.0 0.0 -0.277990 0.000000 1340.666667 4022.0 234.666667 704.0 0.0 0.0 1.000000 4.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 -0.475908 0.0 0.015873 0.015873 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 2.0 23803.70625 0.00 0.0 308223.0 1232892.0 24.0 96.0 18.0 36.0 3.0 0.0 1.0 0.0 0.333333 16761.253269 1307377.755 -3.448718 -269.0 2685.403846 209461.50 26.0 48.0 30.0 0.333333 0.384615 0.133912 1.145200 3.625000 189486.0 -198.128784 -16.662038 2.0 31.000467 0.193679 2.7115 0.223607 20.0 3.0 0.408248 6.000000 2.0 4.0 157500.0 0.656063 0.0 3.440625 1.0 0.0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
In [6]:
df_test.head()
Out[6]:
CNT_CHILDREN AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_3 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_11 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR BUREAU_LOAN_TYPES BUREAU_ACTIVE_LOANS_PERCENTAGE BUREAU_CREDIT_ENDDATE_PERCENTAGE BUREAU_VARIANCE_DAYS_FROM_LAST_APPLICATION BUREAU_DAYS_CREDIT_UPDATE_MEAN BUREAU_AVG_OVERDUE_DAYS BUREAU_AVG_CREDIT_OVERDUE BUREAU_AVG_CREDIT_AMT BUREAU_TOTAL_CREDIT_AMT BUREAU_AVG_CREDIT_DEBT BUREAU_TOTAL_CREDIT_DEBT BUREAU_AVG_CURRENT_AMT_DUE BUREAU_AVG_CREDIT_CARD_LIMIT BUREAU_TOTAL_CREDIT_CARD_LIMIT BUREAU_MAX_ANNUITY BUREAU_AVG_ANNUITY BUREAU_TOTAL_PROLONGED_CREDIT BUREAU_MONTHS_BALANCE_MIN BUREAU_MONTHS_BALANCE_MAX BUREAU_MONTHS_BALANCE_SIZE_MEAN BUREAU_MONTHS_BALANCE_SIZE_SUM BUREAU_STATUS_MIN BUREAU_STATUS_MAX BUREAU_STATUS_MEAN BUREAU_DEBT_PERCENTAGE_MEAN BUREAU_CREDIT_DURATION_MEAN BUREAU_CREDIT_DURATION_SUM BUREAU_ENDDATE_DIF_MEAN BUREAU_ENDDATE_DIF_SUM BUREAU_AVG_COUNT_CAR_LOAN BUREAU_TOT_COUNT_CAR_LOAN BUREAU_AVG_COUNT_CONSUMER_LOAN BUREAU_TOT_COUNT_CONSUMER_LOAN BUREAU_AVG_COUNT_CREDIT_CARD_LOAN BUREAU_TOT_COUNT_CREDIT_CARD_LOAN BUREAU_AVG_COUNT_MORTGAGE BUREAU_TOT_COUNT_MORTGAGE BUREAU_AVG_COUNT_RARE_LOAN BUREAU_TOT_COUNT_RARE_LOAN POS_ACT_CONTRACTS POS_Fut_Install_Mon_Ratio POS_REM_INSTALMENT POS_PAID_LATE POS_PAID_LATE_TOL CC_AMT_CREDIT_LIMIT_ACTUAL_SUM CC_AMT_CREDIT_LIMIT_ACTUAL_MEAN CC_AMT_DRAWINGS_ATM_CURRENT_SUM CC_AMT_DRAWINGS_ATM_CURRENT_MEAN CC_AMT_DRAWINGS_ATM_CURRENT_MIN CC_AMT_DRAWINGS_ATM_CURRENT_MAX CC_AMT_DRAWINGS_CURRENT_MIN CC_AMT_DRAWINGS_CURRENT_MAX CC_AMT_DRAWINGS_OTHER_CURRENT_MEAN CC_AMT_DRAWINGS_OTHER_CURRENT_MIN CC_AMT_DRAWINGS_OTHER_CURRENT_MAX CC_AMT_DRAWINGS_POS_CURRENT_SUM CC_AMT_DRAWINGS_POS_CURRENT_MEAN CC_AMT_DRAWINGS_POS_CURRENT_MIN CC_AMT_DRAWINGS_POS_CURRENT_MAX CC_AMT_INST_MIN_REGULARITY_MIN CC_AMT_PAYMENT_CURRENT_MIN CC_AMT_PAYMENT_TOTAL_CURRENT_SUM CC_AMT_PAYMENT_TOTAL_CURRENT_MEAN CC_AMT_PAYMENT_TOTAL_CURRENT_MIN CC_AMT_PAYMENT_TOTAL_CURRENT_MAX CC_AMT_TOTAL_RECEIVABLE_SUM CC_AMT_TOTAL_RECEIVABLE_MIN CC_AMT_TOTAL_RECEIVABLE_MAX CC_CNT_DRAWINGS_ATM_CURRENT_SUM CC_CNT_DRAWINGS_ATM_CURRENT_MEAN CC_CNT_DRAWINGS_OTHER_CURRENT_MEAN CC_CNT_DRAWINGS_OTHER_CURRENT_MAX CC_CNT_DRAWINGS_POS_CURRENT_SUM CC_CNT_DRAWINGS_POS_CURRENT_MEAN CC_CNT_DRAWINGS_POS_CURRENT_MAX CC_CNT_INSTALMENT_MATURE_CUM_MIN CC_SK_DPD_MAX CC_SK_DPD_DEF_MAX CC_NAME_CONTRACT_STATUS_SUM CC_NAME_CONTRACT_STATUS_MIN CC_TOTAL_INSTALMENTS CC_PERCENTAGE_MISSED_PAYMENTS CC_CREDIT_LOAD_y PREV_APP_SK_ID_PREV_COUNT PREV_APP_NAME_CONTRACT_TYPE_NUNIQUE PREV_APP_AMT_ANNUITY_MEAN PREV_APP_AMT_DOWN_PAYMENT_MEAN PREV_APP_AMT_DOWN_PAYMENT_SUM PREV_APP_AMT_GOODS_PRICE_MEAN PREV_APP_AMT_GOODS_PRICE_SUM PREV_APP_CNT_PAYMENT_MEAN PREV_APP_CNT_PAYMENT_SUM PREV_APP_CNT_PAYMENT_MIN PREV_APP_CNT_PAYMENT_MAX PREV_APP_NAME_CONTRACT_STATUS_Approved_SUM PREV_APP_NAME_CONTRACT_STATUS_Canceled_SUM PREV_APP_NAME_CONTRACT_STATUS_Refused_SUM PREV_APP_NAME_CONTRACT_STATUS_Unused offer_SUM PREV_APP_NAME_CONTRACT_STATUS_REFUSED_RATIO_ INST_PAY_AMT_PAYMENT_MEAN INST_PAY_AMT_PAYMENT_SUM INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_MEAN INST_PAY_INSTALMENT_ACTUAL_DAYS_DIFF_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_MEAN INST_PAY_AMT_INSTAL_ACTUAL_DIFF_SUM INST_PAY_LATE_INSTAL_Binary_1_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_0_SUM INST_PAY_AMT_INSTAL_ACTUAL_DIFF_Binary_1_SUM INST_PAY_LATE_INSTAL_RATIO_ INST_PAY_AMT_PARTIAL_INSTAL_RATIO_ Annuity_Income NEW_CREDIT_TO_GOODS_RATIO GOODS_INCOME LOAN_ACESS INCOME_TO_EMPLOYED_RATIO INCOME_TO_BIRTH_RATIO CNT_NON_CHILD TERM MEAN_BUILDING_SCORE_AVG TOTAL_BUILDING_SCORE_AVG NEW_DOC_STD NEW_DOC_KURT NEW_LIVE_SUM NEW_LIVE_STD NEW_LIVE_KURT AMT_REQ_CREDIT_BUREAU_TOTAL AGE_RANGE NEW_INC_BY_ORG NEW_EXT_SOURCES_MEAN Bureau_No_Record BUREAU_INCOME_CREDIT_RATIO CC_No_Record POS_No_Record NAME_CONTRACT_TYPE_Revolving loans CODE_GENDER_M FLAG_OWN_CAR_Y FLAG_OWN_REALTY_Y NAME_TYPE_SUITE_Family NAME_TYPE_SUITE_Group of people NAME_TYPE_SUITE_Other_A NAME_TYPE_SUITE_Other_B NAME_TYPE_SUITE_Spouse, partner NAME_TYPE_SUITE_Unaccompanied NAME_INCOME_TYPE_Commercial associate NAME_INCOME_TYPE_Pensioner NAME_INCOME_TYPE_State servant NAME_INCOME_TYPE_Student NAME_INCOME_TYPE_Unemployed NAME_INCOME_TYPE_Working NAME_EDUCATION_TYPE_Higher education NAME_EDUCATION_TYPE_Incomplete higher NAME_EDUCATION_TYPE_Lower secondary NAME_EDUCATION_TYPE_Secondary / secondary special NAME_FAMILY_STATUS_Married NAME_FAMILY_STATUS_Separated NAME_FAMILY_STATUS_Single / not married NAME_FAMILY_STATUS_Widow NAME_HOUSING_TYPE_House / apartment NAME_HOUSING_TYPE_Municipal apartment NAME_HOUSING_TYPE_Office apartment NAME_HOUSING_TYPE_Rented apartment NAME_HOUSING_TYPE_With parents OCCUPATION_TYPE_Cleaning staff OCCUPATION_TYPE_Cooking staff OCCUPATION_TYPE_Core staff OCCUPATION_TYPE_Drivers OCCUPATION_TYPE_HR staff OCCUPATION_TYPE_High skill tech staff OCCUPATION_TYPE_IT staff OCCUPATION_TYPE_Laborers OCCUPATION_TYPE_Low-skill Laborers OCCUPATION_TYPE_Managers OCCUPATION_TYPE_Medicine staff OCCUPATION_TYPE_Private service staff OCCUPATION_TYPE_Realty agents OCCUPATION_TYPE_Sales staff OCCUPATION_TYPE_Secretaries OCCUPATION_TYPE_Security staff OCCUPATION_TYPE_Waiters/barmen staff WEEKDAY_APPR_PROCESS_START_MONDAY WEEKDAY_APPR_PROCESS_START_SATURDAY WEEKDAY_APPR_PROCESS_START_SUNDAY WEEKDAY_APPR_PROCESS_START_THURSDAY WEEKDAY_APPR_PROCESS_START_TUESDAY WEEKDAY_APPR_PROCESS_START_WEDNESDAY ORGANIZATION_TYPE_Agriculture ORGANIZATION_TYPE_Bank ORGANIZATION_TYPE_Business Entity Type 1 ORGANIZATION_TYPE_Business Entity Type 2 ORGANIZATION_TYPE_Business Entity Type 3 ORGANIZATION_TYPE_Cleaning ORGANIZATION_TYPE_Construction ORGANIZATION_TYPE_Culture ORGANIZATION_TYPE_Electricity ORGANIZATION_TYPE_Emergency ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Hotel ORGANIZATION_TYPE_Housing ORGANIZATION_TYPE_Industry: type 1 ORGANIZATION_TYPE_Industry: type 10 ORGANIZATION_TYPE_Industry: type 11 ORGANIZATION_TYPE_Industry: type 12 ORGANIZATION_TYPE_Industry: type 13 ORGANIZATION_TYPE_Industry: type 2 ORGANIZATION_TYPE_Industry: type 3 ORGANIZATION_TYPE_Industry: type 4 ORGANIZATION_TYPE_Industry: type 5 ORGANIZATION_TYPE_Industry: type 6 ORGANIZATION_TYPE_Industry: type 7 ORGANIZATION_TYPE_Industry: type 8 ORGANIZATION_TYPE_Industry: type 9 ORGANIZATION_TYPE_Insurance ORGANIZATION_TYPE_Kindergarten ORGANIZATION_TYPE_Legal Services ORGANIZATION_TYPE_Medicine ORGANIZATION_TYPE_Military ORGANIZATION_TYPE_Mobile ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Police ORGANIZATION_TYPE_Postal ORGANIZATION_TYPE_Realtor ORGANIZATION_TYPE_Religion ORGANIZATION_TYPE_Restaurant ORGANIZATION_TYPE_School ORGANIZATION_TYPE_Security ORGANIZATION_TYPE_Security Ministries ORGANIZATION_TYPE_Self-employed ORGANIZATION_TYPE_Services ORGANIZATION_TYPE_Telecom ORGANIZATION_TYPE_Trade: type 1 ORGANIZATION_TYPE_Trade: type 2 ORGANIZATION_TYPE_Trade: type 3 ORGANIZATION_TYPE_Trade: type 4 ORGANIZATION_TYPE_Trade: type 5 ORGANIZATION_TYPE_Trade: type 6 ORGANIZATION_TYPE_Trade: type 7 ORGANIZATION_TYPE_Transport: type 1 ORGANIZATION_TYPE_Transport: type 2 ORGANIZATION_TYPE_Transport: type 3 ORGANIZATION_TYPE_Transport: type 4 ORGANIZATION_TYPE_University ORGANIZATION_TYPE_XNA HOUSETYPE_MODE_specific housing HOUSETYPE_MODE_terraced house WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_Yes
0 0.0 20560.5 450000.0 0.018850 -2329.0 -5170.0 -812.0 0.0 1.0 0.0 1.0 0.0 1.0 2.0 2.0 18.0 0.0 0.0 0.0 0.0 0.0 0.0 0.752614 0.789654 0.159520 0.0590 0.9732 0.00 0.1379 0.1250 0.0487 0.0030 0.0392 0.0 0.0 0.0 -1740.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.428571 0.571429 240043.666667 -93.142857 0.0 0.0 207623.571429 1453365.00 85240.928571 596686.5 0.0 0.000000 0.00 10822.50 3545.357143 0.0 -51.0 0.0 24.571429 172.0 -1.0 0.0 -0.543363 0.282518 817.428571 5722.0 197.000000 788.0 0.0 0.0 1.000000 7.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 -0.019908 0.0 0.111111 0.111111 0.0 0.00 0.0 0.000000 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.000 0.0 0.00 0.0 0.0 0.000 0.000000 0.0 0.0 0.000 0.00 0.000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 1.0 1.0 3951.000 2520.0 2520.0 24835.5 24835.5 8.000000 8.0 8.0 8.0 1.0 0.0 0.0 0.0 0.0 5885.132143 41195.925 -7.285714 -51.0 0.000000 0.000 1.0 7.0 0.0 0.142857 0.000000 0.152300 1.2640 3.333333 118800.0 -57.964792 -7.016267 2.0 27.664697 0.235267 1.4116 0.223607 20.0 3.0 0.516398 -1.875000 0.0 4.0 135000.0 0.567263 0.0 1.537952 1.0 0.0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 0.0 17370.0 180000.0 0.035792 -4469.0 -9118.0 -1623.0 0.0 1.0 0.0 1.0 0.0 0.0 2.0 2.0 9.0 0.0 0.0 0.0 0.0 0.0 0.0 0.564990 0.291656 0.432962 0.0759 0.9816 0.00 0.1379 0.1667 0.0487 0.0030 0.0688 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 2.0 0.666667 0.333333 26340.333333 -54.333333 0.0 0.0 219042.000000 657126.00 189469.500000 568408.5 0.0 0.000000 0.00 4261.50 1420.500000 0.0 -12.0 0.0 7.000000 21.0 -1.0 0.0 -0.416667 0.601256 630.000000 1890.0 -5.000000 -5.0 0.0 0.0 0.666667 2.0 0.333333 1.0 0.0 0.0 0.0 0.0 1.0 -0.327273 0.0 0.000000 0.000000 0.0 0.00 0.0 0.000000 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.000 0.0 0.00 0.0 0.0 0.000 0.000000 0.0 0.0 0.000 0.00 0.000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 2.0 2.0 4813.200 4464.0 4464.0 44617.5 44617.5 12.000000 12.0 12.0 12.0 1.0 1.0 0.0 0.0 0.0 6240.205000 56161.845 -23.555556 -212.0 0.000000 0.000 1.0 9.0 0.0 0.111111 0.000000 0.175455 1.2376 1.818182 42768.0 -22.152607 -5.480514 2.0 12.824870 0.207375 0.0000 0.223607 20.0 3.0 0.547723 -3.333333 3.0 3.0 157500.0 0.429869 0.0 2.212545 1.0 0.0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
2 0.0 69777.0 630000.0 0.019101 -4458.0 -2175.0 -3503.0 5.0 1.0 0.0 1.0 0.0 0.0 2.0 2.0 14.0 0.0 0.0 0.0 0.0 0.0 0.0 0.505143 0.699787 0.610991 0.0759 0.9816 0.00 0.1379 0.1667 0.0487 0.0030 0.0688 0.0 0.0 0.0 -856.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 4.0 2.0 0.000000 1.000000 155208.333333 -775.500000 0.0 19305.0 518070.015000 2072280.06 0.000000 0.0 0.0 0.000000 0.00 0.00 0.000000 0.0 -68.0 0.0 57.500000 230.0 -1.0 0.0 -0.736842 0.000000 669.500000 2678.0 -13.250000 -53.0 0.5 2.0 0.500000 2.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 -0.517857 0.0 0.055556 0.000000 12645000.0 131718.75 571500.0 6350.000000 0.0 157500.0 0.0 157500.00 0.0 0.0 0.0 0.00 0.000 0.0 0.00 0.0 0.0 654448.545 6817.172344 0.0 153675.0 1737703.665 -274.32 161420.220 23.0 0.255556 0.0 0.0 0.0 0.000000 0.0 1.0 1.0 1.0 96.0 1.0 22.0 1.041667 1.024890 4.0 2.0 11478.195 3375.0 6750.0 174495.0 523485.0 17.333333 52.0 6.0 36.0 3.0 1.0 0.0 0.0 0.0 9740.235774 1509736.545 -5.180645 -803.0 1157.662742 179437.725 11.0 135.0 20.0 0.070968 0.129032 0.344578 1.0528 3.111111 33264.0 -45.423957 -10.105799 2.0 9.505482 0.207375 0.0000 0.223607 20.0 3.0 0.547723 -3.333333 5.0 4.0 180000.0 0.655389 0.0 2.558370 0.0 0.0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
3 2.0 49018.5 1575000.0 0.026392 -1866.0 -2000.0 -4208.0 0.0 1.0 0.0 1.0 1.0 0.0 4.0 2.0 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.525734 0.509677 0.612704 0.1974 0.9970 0.32 0.2759 0.3750 0.2078 0.0817 0.3700 0.0 0.0 0.0 -1805.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 2.0 0.416667 0.583333 430355.840909 -651.500000 0.0 0.0 126739.590000 1520875.08 18630.450000 186304.5 0.0 14484.394286 101390.76 12897.09 3012.010714 0.0 -69.0 0.0 46.666667 560.0 -1.0 0.0 -0.538289 0.122267 3932.700000 39327.0 61.166667 367.0 0.0 0.0 0.583333 7.0 0.416667 5.0 0.0 0.0 0.0 0.0 1.0 -0.241353 0.0 0.000000 0.000000 11025000.0 225000.00 27000.0 613.636364 0.0 18000.0 0.0 22823.55 0.0 0.0 0.0 274663.62 6242.355 0.0 22823.55 0.0 0.0 274701.465 5606.152347 0.0 15750.0 390461.850 0.00 36980.415 2.0 0.045455 0.0 0.0 115.0 2.613636 12.0 1.0 0.0 0.0 49.0 1.0 35.0 2.040816 0.165937 5.0 3.0 8091.585 3750.0 11250.0 82012.5 246037.5 11.333333 34.0 0.0 24.0 3.0 1.0 0.0 1.0 0.0 4356.731549 492310.665 -3.000000 -339.0 622.550708 70348.230 12.0 93.0 20.0 0.106195 0.176991 0.155614 1.0000 5.000000 0.0 -168.810289 -22.538638 2.0 32.130726 0.322743 4.5184 0.223607 20.0 3.0 0.516398 -1.875000 3.0 2.0 180000.0 0.549372 0.0 0.402348 0.0 0.0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
4 1.0 32067.0 625500.0 0.010032 -2191.0 -4000.0 -4262.0 16.0 1.0 1.0 1.0 0.0 0.0 3.0 2.0 5.0 0.0 0.0 0.0 0.0 1.0 1.0 0.202145 0.425687 0.537070 0.0759 0.9816 0.00 0.1379 0.1667 0.0487 0.0030 0.0688 0.0 0.0 0.0 -821.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.00 0.000000 0.0 0.0 0.000000 0.00 0.00 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 -0.278388 0.0 0.000000 0.000000 0.0 0.00 0.0 0.000000 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.000 0.0 0.00 0.0 0.0 0.000 0.000000 0.0 0.0 0.000 0.00 0.000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 2.0 2.0 17782.155 8095.5 8095.5 267727.5 535455.0 24.000000 48.0 12.0 36.0 1.0 1.0 0.0 0.0 0.0 11100.337500 133204.050 -12.250000 -147.0 0.000000 0.000 0.0 12.0 0.0 0.000000 0.000000 0.178150 1.0000 3.475000 0.0 -82.154267 -13.803681 2.0 19.506034 0.207375 0.0000 0.223607 20.0 3.0 0.516398 -1.875000 0.0 2.0 180000.0 0.313916 1.0 0.000000 1.0 0.0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

Pre-processed file is large in datasize and it may slow down the process. Below function reduces memory of this dataframe. As we can see, memory usage for training set was 365MB. After reduction it is 88MB. Similarly for test dataset memmory usage is reduced from 115MB to 28MB.

In [7]:
def reduce_memory(df):
    start_mem = df.memory_usage().sum() / 1024**2
    print('memory usage is ' , round(start_mem), 'MB')
    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('end memory usage is ' , round(end_mem), 'MB')
    return df
In [8]:
df_train = reduce_memory(df_train)
memory usage is  365.0 MB
end memory usage is  88.0 MB
In [9]:
df_test = reduce_memory(df_test)
memory usage is  115.0 MB
end memory usage is  28.0 MB

Feature Importance

During pre-processing steps we have reduced dimension of dataset. However, it is possible to reduce it further. We can do this by performing feature importance and remove all feature which has zero importance.

In order to be more conservative, two models XGBOOST and LightGBM were fitted and removed only features which has zero importance in both models. Also for each model, two fold validation performed. This way we get very robust values for feature importance and we will drop features that has zero importance.

Best performing features are

  • Term of Loan (Engineered Feature)

  • Mean of External Sources (Engineered Feature)

  • Length of Employment (Given)

  • External Source Score (Given)

  • Future monthly installment ratio (Engoneered)

  • Anuuity Amount

Several features that looks important at first sight were given zero importance by model in order to predict outcome.

Some of the features that were given zero importance by models are

  • Industry where prople work

  • Requirement of certain document

  • Whether person is unemployed or student, as long as he has income

  • Whether person is first time applicant or not, it does not matter.

In [11]:
X = df_train.drop('TARGET', axis = 1).values
y = df_train['TARGET'].values

X_train, X_test, y_train, y_test = train_test_split(X, y , stratify = y, 
                                                    test_size = 0.3, random_state = 42)
In [12]:
# Initialize an empty array to hold feature importances

## WE WILL DOUBLE FIT IN ORDER TO BE MORE CONSERVATIVE

models = [LGBMClassifier(objective='binary', boosting_type = 'goss', n_estimators = 10000, class_weight = 'balanced'),
          XGBClassifier(objective='binary:logistic', n_estimators = 10000,)]

feature_importances = np.zeros(X_train.shape[1])

# Create the model with several hyperparameters

for model in models:
    for i in range(2):

        # Split into training and validation set
        train_features, valid_features, train_y, valid_y = train_test_split(X_train, y_train, test_size = 0.25, random_state = i*42)

        # Train using early stopping
        model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)], 
                  eval_metric = 'auc', verbose = 200)

        # Record the feature importances
        feature_importances += model.feature_importances_
Training until validation scores don't improve for 100 rounds
[200]	valid_0's auc: 0.766996	valid_0's binary_logloss: 0.456285
Early stopping, best iteration is:
[141]	valid_0's auc: 0.768655	valid_0's binary_logloss: 0.482494
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[67]	valid_0's auc: 0.775001	valid_0's binary_logloss: 0.528651
[0]	validation_0-auc:0.70717
Will train until validation_0-auc hasn't improved in 100 rounds.
Stopping. Best iteration:
[33]	validation_0-auc:0.75892

[0]	validation_0-auc:0.72096
Will train until validation_0-auc hasn't improved in 100 rounds.
Stopping. Best iteration:
[50]	validation_0-auc:0.76620

In [13]:
# Make sure to average feature importances! 
feature_importances = feature_importances / 4
feature_importances = pd.DataFrame({'feature': list(df_train.drop('TARGET', axis=1).columns), 'importance': feature_importances}).sort_values('importance', ascending = False)

display(feature_importances.head())

# Find the features with zero importance
zero_features = list(feature_importances[feature_importances['importance'] == 0.0]['feature'])
print('There are %d features with 0.0 importance' % len(zero_features))
display(feature_importances.tail(20))
feature importance
187 NEW_EXT_SOURCES_MEAN 57.513608
176 TERM 56.753153
99 POS_Fut_Install_Mon_Ratio 33.002396
24 EXT_SOURCE_3 32.002496
23 EXT_SOURCE_2 30.002074
There are 51 features with 0.0 importance
feature importance
48 FLAG_DOCUMENT_17 0.0
255 ORGANIZATION_TYPE_Hotel 0.0
256 ORGANIZATION_TYPE_Housing 0.0
210 NAME_EDUCATION_TYPE_Lower secondary 0.0
258 ORGANIZATION_TYPE_Industry: type 10 0.0
260 ORGANIZATION_TYPE_Industry: type 12 0.0
198 NAME_TYPE_SUITE_Other_A 0.0
261 ORGANIZATION_TYPE_Industry: type 13 0.0
262 ORGANIZATION_TYPE_Industry: type 2 0.0
264 ORGANIZATION_TYPE_Industry: type 4 0.0
265 ORGANIZATION_TYPE_Industry: type 5 0.0
266 ORGANIZATION_TYPE_Industry: type 6 0.0
267 ORGANIZATION_TYPE_Industry: type 7 0.0
268 ORGANIZATION_TYPE_Industry: type 8 0.0
50 FLAG_DOCUMENT_19 0.0
270 ORGANIZATION_TYPE_Insurance 0.0
206 NAME_INCOME_TYPE_Unemployed 0.0
235 OCCUPATION_TYPE_Secretaries 0.0
205 NAME_INCOME_TYPE_Student 0.0
38 FLAG_DOCUMENT_5 0.0
In [14]:
def plot_feature_importances(df, threshold = 0.9):
    """
    Plots 15 most important features and the cumulative importance of features.
    Prints the number of features needed to reach threshold cumulative importance.
    
    Parameters
    --------
    df : dataframe
        Dataframe of feature importances. Columns must be feature and importance
    threshold : float, default = 0.9
        Threshold for prining information about cumulative importances
        
    Return
    --------
    df : dataframe
        Dataframe ordered by feature importances with a normalized column (sums to 1)
        and a cumulative importance column
    
    """
    
    plt.rcParams['font.size'] = 18
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()
    df['cumulative_importance'] = np.cumsum(df['importance_normalized'])

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    # Cumulative importance plot
    plt.figure(figsize = (8, 6))
    plt.plot(list(range(len(df))), df['cumulative_importance'], 'r-')
    plt.xlabel('Number of Features'); plt.ylabel('Cumulative Importance'); 
    plt.title('Cumulative Feature Importance');
    plt.show();
    
    importance_index = np.min(np.where(df['cumulative_importance'] > threshold))
    print('%d features required for %0.2f of cumulative importance' % (importance_index + 1, threshold))
    
    return df
In [15]:
norm_feature_importances = plot_feature_importances(feature_importances)
108 features required for 0.90 of cumulative importance

Drop columns that has zero improtance in XGB and LGBM

In [18]:
df_train_reduced = df_train.drop(zero_features, axis=1)

X = df_train_reduced.drop('TARGET', axis = 1).values
y = df_train_reduced['TARGET'].values

X_train, X_test, y_train, y_test = train_test_split(X, y , stratify = y, 
                                                    test_size = 0.3, random_state = 42)
In [19]:
X_train.shape
Out[19]:
(107628, 259)

Models and Evaluation Metric

We will evalaute model using grid search with several hyper parameters. It is also observed that during GRID Search XGBoost performs slower and takes more time. No. of experiments are adjusted for both models based on time constraint.

Models and Theory

The tree ensemble model consists of a set of classification and regression trees (CART). Usually, a single tree is not strong enough to be used in practice. What is actually used is the ensemble model, which sums the prediction of multiple trees together.

The prediction scores of each individual tree are summed up to get the final score.

Mathematically, we can write our model in the form

\begin{equation*} \hat y_i = \left( \sum_{k=1}^K f_k(x_i) \right) , f_k \in F \end{equation*}

Where $K$ is number of trees, $f$ is function in functional space $F$, and $F$ is the set of all possible CARTs.

Objective or loss funtion is

\begin{equation*} obj(θ)=∑l(y_i, \hat y_i)+ \left(\sum_{k=1}^K Ω(fk) \right) \end{equation*}

XGBoost and LightGBM are the packages belong to the family of gradient boosting decision trees (GBDTs).

  • XGBOOST

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. XGBoost stands for Extremen Gradient Boosting. It uses a library of gradient boosting decsion tree algorithm. Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made.

XGBoost is also known as the regularised version of GBM. This framework includes built-in L1 and L2 regularisation which means it can prevent a model from overfitting.

  • Light GBM

Light Gradient Boosting or LightGBM is a highly efficient gradient boosting decision tree algorithm. It is similar to XGBoost and varies when it comes to the method of creating trees. This algorithm constructs trees leaf-wise in a best-first order due to which there is a tendency to achieve lower loss.

Evaluation Metric

ACCURACY SCORE

Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:

\begin{equation*} Accuracy = \frac{Number\; of\; Correct\; Prediction}{Total\; Number\; of\; Predictions} \end{equation*}

ROC and AUC Score

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

  • True Positive Rate

  • False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

\begin{equation*} TPR = \frac{TP}{TP + FN} \end{equation*}

Flase Positive Rate (FPR) is defined as follows:

\begin{equation*} FPR = \frac{FP}{FP + TN} \end{equation*}

AUC stands for "Area under the ROC Curve."

image.png

After some theory, we will develop functions to perform gridsearch. Although it is possible to develop a one function that can handle all of case studies, as a team we divided task of gridsearch among us and we have developed several function to perform gridsearch

Unbalanced and Manual Balanced Dataset

In [20]:
# A Function to execute the grid search and record the results.
def ConductGridSearch(X_train, y_train, X_test, y_test, i=0, prefix='', n_jobs=-1,verbose=1):
    
    # Create Exp Logbook
    explog = pd.DataFrame(columns= ['Model', 'Train Accuracy Score', 'Train AUC Score', 'Test Accuracy Score',
                                'Test AUC Score','Training Time', 'Experiments', 'Description'])
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train , stratify = y_train, 
                                                    test_size = 0.2, random_state = 42)
    
    # Create a list of classifiers for our grid search experiment
    classifiers = [
        ('XGBoost', XGBClassifier(random_state = 42)),
        ('LightGBM', LGBMClassifier(random_state=42)),
    ]

    # Arrange grid search parameters for each classifier

    params_grid = {
        
        'XGBoost':  {
            'n_estimators' : [500, 1000], 
            'eta': [0.01, 0.05, 0.1],
            'min_child_weight' : [20, 35],
            'colsample_bytree': [0.8, 0.95],
#             'reg_alpha'  : [0.01, 0.05, 0.1],
#             'reg_lambda' : [0.01, 0.05, 0.1],
            'max_depth' : [3,6,8],
            'verbosity': [0],
            'verbose_eval': [False],
        },
             
        'LightGBM':  {
            'n_estimators' : [5000, 10000],  
            'boosting_type': ['gbdt'],
            'num_leaves': [30, 35],
            'learning_rate': [0.01, 0.05, 0.1],
            'colsample_bytree': [0.8, 0.95 ],
            'is_unbalance': [True],
            'reg_alpha'  : [0.01, 0.02, 0.05],
            'reg_lambda' : [0.01, 0.02, 0.05],
            'min_split_gain' :[0.01, 0.02, 0.05],
        },
    }
    
    fit_parms_grid = {

        'XGBoost':  {
            "early_stopping_rounds":50, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_val,y_val)],
            
            
        },


        'LightGBM':  {
            "early_stopping_rounds":50, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_val,y_val)],
            'eval_names': ['valid'],
            'verbose': -1,
        }
    }    
    
    
    
    
    for (name, classifier) in classifiers:
        i += 1
        # Print classifier and parameters
        print('****** START',prefix, name,'*****')
        parameters = params_grid[name]
        fit_parameter = fit_parms_grid[name]
        print("Parameters:")
        for p in sorted(parameters.keys()):
            print("\t"+str(p)+": "+ str(parameters[p]))
            
        
        total_exp = len(ParameterGrid(parameters))*5
        
        # generate the pipeline
        full_pipeline_with_predictor = Pipeline([
#         ("preparation", data_pipeline),
        ("predictor", classifier)
        ])
        
        # Execute the grid search
        params = {}
        for p in parameters.keys():
            pipe_key = 'predictor__'+str(p)
            params[pipe_key] = parameters[p] 
        
        fit_params = {}
        for f in fit_parameter.keys():
            fit_key = 'predictor__'+str(f)
            fit_params[fit_key] = fit_parameter[f] 
            
            
        grid_search = GridSearchCV(full_pipeline_with_predictor, params, scoring='roc_auc', cv=5, 
                                   n_jobs=n_jobs, verbose=verbose)
        grid_search.fit(X_train, y_train, **fit_params)
                
        # # Best estimator score
        # best_train = pct(grid_search.best_score_)

        # Best estimator fitting time
        
        start = time()
        grid_search.best_estimator_.fit(X_train, y_train)
        Train_time = round(time() - start, 4)

        # Best estimator prediction time
#         y_pred_train = grid_search.best_estimator_.predict(X_train)
#         y_pred = grid_search.best_estimator_.predict(X_test)
        
        accuracy_train = round(accuracy_score(y_train,grid_search.best_estimator_.predict(X_train)), 3)
        roc_train = round(roc_auc_score(y_train,  grid_search.best_estimator_.predict_proba(X_train)[:, 1]), 3)
                
        accuracy_test = round(accuracy_score(y_test,grid_search.best_estimator_.predict(X_test)), 3)
        roc_test = round(roc_auc_score(y_test, grid_search.best_estimator_.predict_proba(X_test)[:, 1]), 3)

        print('*'*40)
        print('\n')

        title = name + " - Normalized Confusion Matrix"
        disp = plot_confusion_matrix(grid_search.best_estimator_,X_test, y_test, normalize='true')
        disp.ax_.set_title(title)
        plt.show()

        print('-'*40)
        print('\n')

        disp2 = plot_roc_curve(grid_search.best_estimator_, X_test, y_test, )
        disp2.ax_.set_title ("ROC curve - " + name)
        plt.show()

        print('*'*40)
        print('\n')

        # Collect the best parameters found by the grid search
        print("Best Parameters:")
        best_parameters = grid_search.best_estimator_.get_params()
        param_dump = []
        for param_name in sorted(params.keys()):
            param_dump.append((param_name, best_parameters[param_name]))
            print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
        print("****** FINISH",prefix,name," *****")
        print("")
        print("-"*40)
        print("*"*40)
        print("-"*40)
        # Record the results
        explog.loc[len(explog)] = [prefix+name, accuracy_train, roc_train,
                                   accuracy_test, roc_test, Train_time, total_exp,
                                   json.dumps(param_dump)]
    sttime = datetime.now().strftime('%Y%m%d_%H:%M:%S - ') 
    display(explog)
    explog.to_csv(sttime + 'experiment_log.csv', index = False)

Function to run gridsearch for SMOTE Balanced

In [21]:
# A Function to execute the grid search and record the results.
def SMOTE_ConductGridSearch(X_train, y_train, X_test, y_test, i=0, prefix='', n_jobs=-1,verbose=1):
    
    # Create Exp Logbook
    explog = pd.DataFrame(columns= ['Model', 'Train Accuracy Score', 'Train AUC Score', 'Test Accuracy Score',
                                'Test AUC Score','Training Time', 'Experiments', 'Description'])
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train , stratify = y_train, 
                                                    test_size = 0.2, random_state = 42)
    
    # Create a list of classifiers for our grid search experiment
    classifiers = [
        ('XGBoost', XGBClassifier(random_state = 42)),
        ('LightGBM', LGBMClassifier(random_state=42)),
    ]

    # Arrange grid search parameters for each classifier

    params_grid = {
        
        'XGBoost':  {
            'n_estimators' : [500, 1000], 
            'eta': [0.01, 0.05, 0.1],
            'min_child_weight' : [20, 35],
            'colsample_bytree': [0.8],
            'reg_alpha'  : [0.01],
            'reg_lambda' : [0.05],
            'max_depth' : [6,8],
            'verbosity': [0],
            'verbose_eval': [False],
        },
             
        'LightGBM':  {
            'n_estimators' : [5000, 10000],  
            'boosting_type': ['gbdt'],
            'num_leaves': [30, 35],
            'learning_rate': [0.01, 0.05, 0.1],
            'colsample_bytree': [0.8, 0.95 ],
            'is_unbalance': [True],
            'reg_alpha'  : [0.01, 0.02, 0.05],
            'reg_lambda' : [0.01, 0.02, 0.05],
            'min_split_gain' :[0.01, 0.02, 0.05],
        },
    }
    
    fit_parms_grid = {

        'XGBoost':  {
            "early_stopping_rounds":50, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_val,y_val)],
            
            
        },


        'LightGBM':  {
            "early_stopping_rounds":50, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_val,y_val)],
            'eval_names': ['valid'],
            'verbose': -1,
        }
    }    
    
    
    
    
    for (name, classifier) in classifiers:
        i += 1
        # Print classifier and parameters
        print('****** START',prefix, name,'*****')
        parameters = params_grid[name]
        fit_parameter = fit_parms_grid[name]

            
        
        total_exp = len(ParameterGrid(parameters))*5
        
        # generate the pipeline
        full_pipeline_with_predictor = imbpipe([
        ('sampling', SMOTE(random_state=42)),
        ("predictor", classifier)
        ])

        # Execute the grid search
        params = {'sampling__sampling_strategy' : [0.25,0.5,'auto']}
        for p in parameters.keys():
            pipe_key = 'predictor__'+str(p)
            params[pipe_key] = parameters[p] 
        
        print("Parameters:")
        for p in sorted(params.keys()):
            print("\t"+str(p)+": "+ str(params[p]))
        
        fit_params = {}
        for f in fit_parameter.keys():
            fit_key = 'predictor__'+str(f)
            fit_params[fit_key] = fit_parameter[f] 
            
            
        grid_search = GridSearchCV(full_pipeline_with_predictor, params, scoring='roc_auc', cv=5, 
                                   n_jobs=n_jobs, verbose=verbose)
        grid_search.fit(X_train, y_train, **fit_params)
                
        # # Best estimator score
        # best_train = pct(grid_search.best_score_)

        # Best estimator fitting time
        
        start = time()
        grid_search.best_estimator_.fit(X_train, y_train)
        Train_time = round(time() - start, 4)

        # Best estimator prediction time
#         y_pred_train = grid_search.best_estimator_.predict(X_train)
#         y_pred = grid_search.best_estimator_.predict(X_test)
        
        accuracy_train = round(accuracy_score(y_train,grid_search.best_estimator_.predict(X_train)), 3)
        roc_train = round(roc_auc_score(y_train,  grid_search.best_estimator_.predict_proba(X_train)[:, 1]), 3)
                
        accuracy_test = round(accuracy_score(y_test,grid_search.best_estimator_.predict(X_test)), 3)
        roc_test = round(roc_auc_score(y_test, grid_search.best_estimator_.predict_proba(X_test)[:, 1]), 3)

        print('*'*40)
        print('\n')

        title = name + " - Normalized Confusion Matrix"
        disp = plot_confusion_matrix(grid_search.best_estimator_,X_test, y_test, normalize='true')
        disp.ax_.set_title(title)
        plt.show()

        print('-'*40)
        print('\n')

        disp2 = plot_roc_curve(grid_search.best_estimator_, X_test, y_test, )
        disp2.ax_.set_title ("ROC curve - " + name)
        plt.show()

        print('*'*40)
        print('\n')

        # Collect the best parameters found by the grid search
        print("Best Parameters:")
        best_parameters = grid_search.best_estimator_.get_params()
        param_dump = []
        for param_name in sorted(params.keys()):
            param_dump.append((param_name, best_parameters[param_name]))
            print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
        print("****** FINISH",prefix,name," *****")
        print("")
        print("-"*40)
        print("*"*40)
        print("-"*40)
        # Record the results
        explog.loc[len(explog)] = [prefix+name, accuracy_train, roc_train,
                                   accuracy_test, roc_test, Train_time, total_exp,
                                   json.dumps(param_dump)]
    sttime = datetime.now().strftime('%Y%m%d_%H:%M:%S - ') 
    display(explog)
    explog.to_csv(sttime + 'experiment_log.csv', index = False)

Function to run gridsearch With PCA (For unbalanced Set)

In [53]:
# A Function to execute the grid search and record the results.
def PCA_ConductGridSearch(X_train, y_train, X_test, y_test, i=0, prefix='', n_jobs=-1,verbose=1):
    
    # Create Exp Logbook
    explog = pd.DataFrame(columns= ['Model', 'Train Accuracy Score', 'Train AUC Score', 'Test Accuracy Score',
                                'Test AUC Score','Training Time', 'Experiments', 'Description'])
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train , stratify = y_train, 
                                                    test_size = 0.2, random_state = 42)
    
    # Create a list of classifiers for our grid search experiment
    classifiers = [
        ('XGBoost', XGBClassifier(random_state = 42)),
        ('LightGBM', LGBMClassifier(random_state=42)),
    ]

    # Arrange grid search parameters for each classifier

    params_grid = {
        
        'XGBoost':  {
            'n_estimators' : [500, 1000], 
            'eta': [0.01, 0.05, 0.1],
            'min_child_weight' : [20, 35],
            'colsample_bytree': [0.8, 0.95],
#             'reg_alpha'  : [0.01, 0.05, 0.1],
#             'reg_lambda' : [0.01, 0.05, 0.1],
            'max_depth' : [3,6,8],
            'verbosity': [0],
            'verbose_eval': [False],
        },
             
        'LightGBM':  {
            'n_estimators' : [5000, 10000],  
            'boosting_type': ['gbdt'],
            'num_leaves': [30, 35],
            'learning_rate': [0.01, 0.05, 0.1],
            'colsample_bytree': [0.8, 0.95 ],
            'is_unbalance': [True],
            'reg_alpha'  : [0.01, 0.02, 0.05],
            'reg_lambda' : [0.01, 0.02, 0.05],
            'min_split_gain' :[0.01, 0.02, 0.05],
        },
    }
    
    fit_parms_grid = {

        'XGBoost':  {
            "early_stopping_rounds":50, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_val,y_val)],
            
            
        },


        'LightGBM':  {
            "early_stopping_rounds":50, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_val,y_val)],
            'eval_names': ['valid'],
            'verbose': -1,
        }
    }    
    
    
    
    
    for (name, classifier) in classifiers:
        i += 1
        # Print classifier and parameters
        print('****** START',prefix, name,'*****')
        parameters = params_grid[name]
        fit_parameter = fit_parms_grid[name]

            
        
        total_exp = len(ParameterGrid(parameters))*5
        
        # generate the pipeline
        full_pipeline_with_predictor = Pipeline([
        ('pca', PCA()),
        ("predictor", classifier)
        ])

        # Execute the grid search
        params = {'pca__n_components' : [0.9,0.95,0.99]}
        for p in parameters.keys():
            pipe_key = 'predictor__'+str(p)
            params[pipe_key] = parameters[p] 
            
        print("Parameters:")
        for p in sorted(params.keys()):
            print("\t"+str(p)+": "+ str(params[p]))
        
        fit_params = {}
        for f in fit_parameter.keys():
            fit_key = 'predictor__'+str(f)
            fit_params[fit_key] = fit_parameter[f] 
            
            
        grid_search = GridSearchCV(full_pipeline_with_predictor, params, scoring='roc_auc', cv=5, 
                                   n_jobs=n_jobs, verbose=verbose)
        grid_search.fit(X_train, y_train, **fit_params)
                
        # # Best estimator score
        # best_train = pct(grid_search.best_score_)

        # Best estimator fitting time
        
        start = time()
        grid_search.best_estimator_.fit(X_train, y_train)
        Train_time = round(time() - start, 4)

        # Best estimator prediction time
#         y_pred_train = grid_search.best_estimator_.predict(X_train)
#         y_pred = grid_search.best_estimator_.predict(X_test)
        
        accuracy_train = round(accuracy_score(y_train,grid_search.best_estimator_.predict(X_train)), 3)
        roc_train = round(roc_auc_score(y_train,  grid_search.best_estimator_.predict_proba(X_train)[:, 1]), 3)
                
        accuracy_test = round(accuracy_score(y_test,grid_search.best_estimator_.predict(X_test)), 3)
        roc_test = round(roc_auc_score(y_test, grid_search.best_estimator_.predict_proba(X_test)[:, 1]), 3)

        print('*'*40)
        print('\n')

        title = name + " - Normalized Confusion Matrix"
        disp = plot_confusion_matrix(grid_search.best_estimator_,X_test, y_test, normalize='true')
        disp.ax_.set_title(title)
        plt.show()

        print('-'*40)
        print('\n')

        disp2 = plot_roc_curve(grid_search.best_estimator_, X_test, y_test, )
        disp2.ax_.set_title ("ROC curve - " + name)
        plt.show()

        print('*'*40)
        print('\n')

        # Collect the best parameters found by the grid search
        print("Best Parameters:")
        best_parameters = grid_search.best_estimator_.get_params()
        param_dump = []
        for param_name in sorted(params.keys()):
            param_dump.append((param_name, best_parameters[param_name]))
            print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
        print("****** FINISH",prefix,name," *****")
        print("")
        print("-"*40)
        print("*"*40)
        print("-"*40)
        # Record the results
        explog.loc[len(explog)] = [prefix+name, accuracy_train, roc_train,
                                   accuracy_test, roc_test, Train_time, total_exp,
                                   json.dumps(param_dump)]
    sttime = datetime.now().strftime('%Y%m%d_%H:%M:%S - ') 
    display(explog)
    explog.to_csv(sttime + 'experiment_log.csv', index = False)

Function to run gridsearch With Scaling (For unbalanced Set)

In [ ]:
class Custom_scaling(TransformerMixin, BaseEstimator):
    '''A template for a custom transformer.'''

    def __init__(self):
        
        self.scale = StandardScaler()
        
#         self.ohe = OneHotEncoder(handle_unknown='ignore'))

        
        ## Column Lists
        self.cat_list = []
        self.dis_num_list = []
        self.num_list = []
        


    def fit(self, X, y=None):
        df = pd.DataFrame(X)
        self.cat_list, self.dis_num_list, self.num_list = self._feature_type_split(df)
        
        self.scale.fit(df[self.num_list])
        
        return self

    def transform(self, X):
        # transform X via code or additional methods
        df = pd.DataFrame(X)
        # Continous Numerical List
        df[self.num_list] = self.scale.transform(df[self.num_list])
        
        print("Scaling completed............")
        return df.values
    
    def _feature_type_split(self, X):
        cat_list = []
        dis_num_list = []
        num_list = []
        df = X.copy()
        # df= df.drop('SK_ID_CURR', axis =1)
        for i in df.columns.tolist():
            if i != 'TARGET':
                if df[i].dtype == 'object':
                    cat_list.append(i)
                elif df[i].nunique() < 25:
                    dis_num_list.append(i)
                else:
                    num_list.append(i)
        print('Columns split completed............')
        return cat_list, dis_num_list, num_list
In [98]:
# A Function to execute the grid search and record the results.
def Scaled_ConductGridSearch(X_train, y_train, X_test, y_test, i=0, prefix='', n_jobs=-1,verbose=1):
    
    # Create Exp Logbook
    explog = pd.DataFrame(columns= ['Model', 'Train Accuracy Score', 'Train AUC Score', 'Test Accuracy Score',
                                'Test AUC Score','Training Time', 'Experiments', 'Description'])
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train , stratify = y_train, 
                                                    test_size = 0.2, random_state = 42)
    
    # Create a list of classifiers for our grid search experiment
    classifiers = [
        ('XGBoost', XGBClassifier(random_state = 42)),
        ('LightGBM', LGBMClassifier(random_state=42)),
    ]

    # Arrange grid search parameters for each classifier

    params_grid = {
        
        'XGBoost':  {
            'n_estimators' : [500, 1000], 
            'eta': [0.01, 0.05, 0.1],
            'min_child_weight' : [20, 35],
            'colsample_bytree': [0.8, 0.95],
#             'reg_alpha'  : [0.01, 0.05, 0.1],
#             'reg_lambda' : [0.01, 0.05, 0.1],
            'max_depth' : [3,6,8],
            'verbosity': [0],
            'verbose_eval': [False],
        },
             
        'LightGBM':  {
            'n_estimators' : [5000, 10000],  
            'boosting_type': ['gbdt'],
            'num_leaves': [30, 35],
            'learning_rate': [0.01, 0.05, 0.1],
            'colsample_bytree': [0.8, 0.95 ],
            'is_unbalance': [True],
            'reg_alpha'  : [0.01, 0.02, 0.05],
            'reg_lambda' : [0.01, 0.02, 0.05],
            'min_split_gain' :[0.01, 0.02, 0.05],
        },
    }
    
    fit_parms_grid = {

        'XGBoost':  {
            "early_stopping_rounds":50, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_val,y_val)],
            
            
        },


        'LightGBM':  {
            "early_stopping_rounds":50, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_val,y_val)],
            'eval_names': ['valid'],
            'verbose': -1,
        }
    }    
    
    
    
    
    for (name, classifier) in classifiers:
        i += 1
        # Print classifier and parameters
        print('****** START',prefix, name,'*****')
        parameters = params_grid[name]
        fit_parameter = fit_parms_grid[name]

            
        
        total_exp = len(ParameterGrid(parameters))*5
        
        # generate the pipeline
        full_pipeline_with_predictor = Pipeline([
        ('scale', Custom_scaling()),
        ("predictor", classifier)
        ])

        # Execute the grid search
        params = {}
        for p in parameters.keys():
            pipe_key = 'predictor__'+str(p)
            params[pipe_key] = parameters[p] 
            
        print("Parameters:")
        for p in sorted(params.keys()):
            print("\t"+str(p)+": "+ str(params[p]))
        
        fit_params = {}
        for f in fit_parameter.keys():
            fit_key = 'predictor__'+str(f)
            fit_params[fit_key] = fit_parameter[f] 
            
            
        grid_search = GridSearchCV(full_pipeline_with_predictor, params, scoring='roc_auc', cv=5, 
                                   n_jobs=n_jobs, verbose=verbose)
        grid_search.fit(X_train, y_train, **fit_params)
                
        # # Best estimator score
        # best_train = pct(grid_search.best_score_)

        # Best estimator fitting time
        
        start = time()
        grid_search.best_estimator_.fit(X_train, y_train)
        Train_time = round(time() - start, 4)

        # Best estimator prediction time
#         y_pred_train = grid_search.best_estimator_.predict(X_train)
#         y_pred = grid_search.best_estimator_.predict(X_test)
        
        accuracy_train = round(accuracy_score(y_train,grid_search.best_estimator_.predict(X_train)), 3)
        roc_train = round(roc_auc_score(y_train,  grid_search.best_estimator_.predict_proba(X_train)[:, 1]), 3)
                
        accuracy_test = round(accuracy_score(y_test,grid_search.best_estimator_.predict(X_test)), 3)
        roc_test = round(roc_auc_score(y_test, grid_search.best_estimator_.predict_proba(X_test)[:, 1]), 3)

        print('*'*40)
        print('\n')

        title = name + " - Normalized Confusion Matrix"
        disp = plot_confusion_matrix(grid_search.best_estimator_,X_test, y_test, normalize='true')
        disp.ax_.set_title(title)
        plt.show()

        print('-'*40)
        print('\n')

        disp2 = plot_roc_curve(grid_search.best_estimator_, X_test, y_test, )
        disp2.ax_.set_title ("ROC curve - " + name)
        plt.show()

        print('*'*40)
        print('\n')

        # Collect the best parameters found by the grid search
        print("Best Parameters:")
        best_parameters = grid_search.best_estimator_.get_params()
        param_dump = []
        for param_name in sorted(params.keys()):
            param_dump.append((param_name, best_parameters[param_name]))
            print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
        print("****** FINISH",prefix,name," *****")
        print("")
        print("-"*40)
        print("*"*40)
        print("-"*40)
        # Record the results
        explog.loc[len(explog)] = [prefix+name, accuracy_train, roc_train,
                                   accuracy_test, roc_test, Train_time, total_exp,
                                   json.dumps(param_dump)]
    sttime = datetime.now().strftime('%Y%m%d_%H:%M:%S - ') 
    display(explog)
    explog.to_csv(sttime + 'experiment_log.csv', index = False)

Hyper Parameter Tuning

Unbalanced Datset Model Tuning and Result

In [5]:
full_train['TARGET'].value_counts(normalize = True).round(2).plot.pie(autopct='%1.1f%%')
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f030f8a0ba8>
In [ ]:
%%time
# This might take a while
if __name__ == "__main__":

    ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model Unbalanced:",  n_jobs=-1,verbose=1, )
****** START Best Model Unbalanced: XGBoost *****
Parameters:
	colsample_bytree: [0.8, 0.95]
	eta: [0.01, 0.05, 0.1]
	max_depth: [3, 6, 8]
	min_child_weight: [20, 35]
	n_estimators: [500, 1000]
	verbose_eval: [False]
	verbosity: [0]
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed: 97.7min
[Parallel(n_jobs=-1)]: Done 190 tasks      | elapsed: 340.9min
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed: 714.0min finished
[0]	validation_0-auc:0.70102
Will train until validation_0-auc hasn't improved in 50 rounds.
[1]	validation_0-auc:0.70689
[2]	validation_0-auc:0.71315
[3]	validation_0-auc:0.71277
[4]	validation_0-auc:0.71405
[5]	validation_0-auc:0.71547
[6]	validation_0-auc:0.71549
[7]	validation_0-auc:0.71521
[8]	validation_0-auc:0.71679
[9]	validation_0-auc:0.71751
[10]	validation_0-auc:0.71709
[11]	validation_0-auc:0.71755
[12]	validation_0-auc:0.71810
[13]	validation_0-auc:0.71863
[14]	validation_0-auc:0.71903
[15]	validation_0-auc:0.71915
[16]	validation_0-auc:0.71947
[17]	validation_0-auc:0.72039
[18]	validation_0-auc:0.72062
[19]	validation_0-auc:0.72070
[20]	validation_0-auc:0.72077
[21]	validation_0-auc:0.72084
[22]	validation_0-auc:0.72114
[23]	validation_0-auc:0.72092
[24]	validation_0-auc:0.72119
[25]	validation_0-auc:0.72193
[26]	validation_0-auc:0.72197
[27]	validation_0-auc:0.72303
[28]	validation_0-auc:0.72314
[29]	validation_0-auc:0.72338
[30]	validation_0-auc:0.72355
[31]	validation_0-auc:0.72377
[32]	validation_0-auc:0.72388
[33]	validation_0-auc:0.72410
[34]	validation_0-auc:0.72425
[35]	validation_0-auc:0.72451
[36]	validation_0-auc:0.72484
[37]	validation_0-auc:0.72593
[38]	validation_0-auc:0.72597
[39]	validation_0-auc:0.72621
[40]	validation_0-auc:0.72644
[41]	validation_0-auc:0.72675
[42]	validation_0-auc:0.72682
[43]	validation_0-auc:0.72694
[44]	validation_0-auc:0.72818
[45]	validation_0-auc:0.72823
[46]	validation_0-auc:0.72850
[47]	validation_0-auc:0.72857
[48]	validation_0-auc:0.72873
[49]	validation_0-auc:0.72875
[50]	validation_0-auc:0.72911
[51]	validation_0-auc:0.72929
[52]	validation_0-auc:0.72999
[53]	validation_0-auc:0.73059
[54]	validation_0-auc:0.73068
[55]	validation_0-auc:0.73111
[56]	validation_0-auc:0.73120
[57]	validation_0-auc:0.73249
[58]	validation_0-auc:0.73294
[59]	validation_0-auc:0.73345
[60]	validation_0-auc:0.73364
[61]	validation_0-auc:0.73401
[62]	validation_0-auc:0.73436
[63]	validation_0-auc:0.73443
[64]	validation_0-auc:0.73479
[65]	validation_0-auc:0.73504
[66]	validation_0-auc:0.73542
[67]	validation_0-auc:0.73625
[68]	validation_0-auc:0.73656
[69]	validation_0-auc:0.73703
[70]	validation_0-auc:0.73742
[71]	validation_0-auc:0.73836
[72]	validation_0-auc:0.73891
[73]	validation_0-auc:0.73904
[74]	validation_0-auc:0.73963
[75]	validation_0-auc:0.73990
[76]	validation_0-auc:0.74029
[77]	validation_0-auc:0.74083
[78]	validation_0-auc:0.74149
[79]	validation_0-auc:0.74166
[80]	validation_0-auc:0.74208
[81]	validation_0-auc:0.74257
[82]	validation_0-auc:0.74280
[83]	validation_0-auc:0.74294
[84]	validation_0-auc:0.74334
[85]	validation_0-auc:0.74381
[86]	validation_0-auc:0.74459
[87]	validation_0-auc:0.74521
[88]	validation_0-auc:0.74562
[89]	validation_0-auc:0.74615
[90]	validation_0-auc:0.74649
[91]	validation_0-auc:0.74674
[92]	validation_0-auc:0.74699
[93]	validation_0-auc:0.74754
[94]	validation_0-auc:0.74808
[95]	validation_0-auc:0.74812
[96]	validation_0-auc:0.74850
[97]	validation_0-auc:0.74880
[98]	validation_0-auc:0.74892
[99]	validation_0-auc:0.74919
[100]	validation_0-auc:0.74960
[101]	validation_0-auc:0.74996
[102]	validation_0-auc:0.75044
[103]	validation_0-auc:0.75065
[104]	validation_0-auc:0.75089
[105]	validation_0-auc:0.75135
[106]	validation_0-auc:0.75168
[107]	validation_0-auc:0.75192
[108]	validation_0-auc:0.75217
[109]	validation_0-auc:0.75250
[110]	validation_0-auc:0.75282
[111]	validation_0-auc:0.75307
[112]	validation_0-auc:0.75341
[113]	validation_0-auc:0.75374
[114]	validation_0-auc:0.75400
[115]	validation_0-auc:0.75426
[116]	validation_0-auc:0.75436
[117]	validation_0-auc:0.75458
[118]	validation_0-auc:0.75458
[119]	validation_0-auc:0.75481
[120]	validation_0-auc:0.75502
[121]	validation_0-auc:0.75518
[122]	validation_0-auc:0.75542
[123]	validation_0-auc:0.75559
[124]	validation_0-auc:0.75593
[125]	validation_0-auc:0.75610
[126]	validation_0-auc:0.75607
[127]	validation_0-auc:0.75621
[128]	validation_0-auc:0.75653
[129]	validation_0-auc:0.75666
[130]	validation_0-auc:0.75684
[131]	validation_0-auc:0.75701
[132]	validation_0-auc:0.75726
[133]	validation_0-auc:0.75743
[134]	validation_0-auc:0.75748
[135]	validation_0-auc:0.75772
[136]	validation_0-auc:0.75799
[137]	validation_0-auc:0.75820
[138]	validation_0-auc:0.75845
[139]	validation_0-auc:0.75861
[140]	validation_0-auc:0.75894
[141]	validation_0-auc:0.75916
[142]	validation_0-auc:0.75934
[143]	validation_0-auc:0.75933
[144]	validation_0-auc:0.75958
[145]	validation_0-auc:0.75974
[146]	validation_0-auc:0.75982
[147]	validation_0-auc:0.75984
[148]	validation_0-auc:0.76008
[149]	validation_0-auc:0.76018
[150]	validation_0-auc:0.76023
[151]	validation_0-auc:0.76021
[152]	validation_0-auc:0.76039
[153]	validation_0-auc:0.76058
[154]	validation_0-auc:0.76069
[155]	validation_0-auc:0.76085
[156]	validation_0-auc:0.76099
[157]	validation_0-auc:0.76110
[158]	validation_0-auc:0.76109
[159]	validation_0-auc:0.76107
[160]	validation_0-auc:0.76111
[161]	validation_0-auc:0.76113
[162]	validation_0-auc:0.76130
[163]	validation_0-auc:0.76131
[164]	validation_0-auc:0.76137
[165]	validation_0-auc:0.76157
[166]	validation_0-auc:0.76173
[167]	validation_0-auc:0.76194
[168]	validation_0-auc:0.76198
[169]	validation_0-auc:0.76212
[170]	validation_0-auc:0.76217
[171]	validation_0-auc:0.76225
[172]	validation_0-auc:0.76236
[173]	validation_0-auc:0.76240
[174]	validation_0-auc:0.76241
[175]	validation_0-auc:0.76245
[176]	validation_0-auc:0.76252
[177]	validation_0-auc:0.76268
[178]	validation_0-auc:0.76267
[179]	validation_0-auc:0.76280
[180]	validation_0-auc:0.76295
[181]	validation_0-auc:0.76301
[182]	validation_0-auc:0.76311
[183]	validation_0-auc:0.76336
[184]	validation_0-auc:0.76346
[185]	validation_0-auc:0.76351
[186]	validation_0-auc:0.76349
[187]	validation_0-auc:0.76355
[188]	validation_0-auc:0.76366
[189]	validation_0-auc:0.76379
[190]	validation_0-auc:0.76394
[191]	validation_0-auc:0.76406
[192]	validation_0-auc:0.76410
[193]	validation_0-auc:0.76413
[194]	validation_0-auc:0.76416
[195]	validation_0-auc:0.76424
[196]	validation_0-auc:0.76428
[197]	validation_0-auc:0.76431
[198]	validation_0-auc:0.76441
[199]	validation_0-auc:0.76443
[200]	validation_0-auc:0.76445
[201]	validation_0-auc:0.76449
[202]	validation_0-auc:0.76451
[203]	validation_0-auc:0.76464
[204]	validation_0-auc:0.76475
[205]	validation_0-auc:0.76480
[206]	validation_0-auc:0.76488
[207]	validation_0-auc:0.76499
[208]	validation_0-auc:0.76499
[209]	validation_0-auc:0.76496
[210]	validation_0-auc:0.76502
[211]	validation_0-auc:0.76508
[212]	validation_0-auc:0.76511
[213]	validation_0-auc:0.76517
[214]	validation_0-auc:0.76524
[215]	validation_0-auc:0.76532
[216]	validation_0-auc:0.76539
[217]	validation_0-auc:0.76546
[218]	validation_0-auc:0.76562
[219]	validation_0-auc:0.76560
[220]	validation_0-auc:0.76566
[221]	validation_0-auc:0.76575
[222]	validation_0-auc:0.76588
[223]	validation_0-auc:0.76588
[224]	validation_0-auc:0.76591
[225]	validation_0-auc:0.76592
[226]	validation_0-auc:0.76592
[227]	validation_0-auc:0.76593
[228]	validation_0-auc:0.76590
[229]	validation_0-auc:0.76593
[230]	validation_0-auc:0.76600
[231]	validation_0-auc:0.76610
[232]	validation_0-auc:0.76614
[233]	validation_0-auc:0.76614
[234]	validation_0-auc:0.76616
[235]	validation_0-auc:0.76625
[236]	validation_0-auc:0.76631
[237]	validation_0-auc:0.76641
[238]	validation_0-auc:0.76641
[239]	validation_0-auc:0.76649
[240]	validation_0-auc:0.76653
[241]	validation_0-auc:0.76659
[242]	validation_0-auc:0.76653
[243]	validation_0-auc:0.76660
[244]	validation_0-auc:0.76656
[245]	validation_0-auc:0.76657
[246]	validation_0-auc:0.76652
[247]	validation_0-auc:0.76680
[248]	validation_0-auc:0.76689
[249]	validation_0-auc:0.76696
[250]	validation_0-auc:0.76698
[251]	validation_0-auc:0.76700
[252]	validation_0-auc:0.76703
[253]	validation_0-auc:0.76701
[254]	validation_0-auc:0.76703
[255]	validation_0-auc:0.76722
[256]	validation_0-auc:0.76721
[257]	validation_0-auc:0.76717
[258]	validation_0-auc:0.76717
[259]	validation_0-auc:0.76723
[260]	validation_0-auc:0.76725
[261]	validation_0-auc:0.76733
[262]	validation_0-auc:0.76731
[263]	validation_0-auc:0.76733
[264]	validation_0-auc:0.76739
[265]	validation_0-auc:0.76745
[266]	validation_0-auc:0.76753
[267]	validation_0-auc:0.76748
[268]	validation_0-auc:0.76748
[269]	validation_0-auc:0.76745
[270]	validation_0-auc:0.76747
[271]	validation_0-auc:0.76746
[272]	validation_0-auc:0.76750
[273]	validation_0-auc:0.76750
[274]	validation_0-auc:0.76752
[275]	validation_0-auc:0.76754
[276]	validation_0-auc:0.76758
[277]	validation_0-auc:0.76767
[278]	validation_0-auc:0.76767
[279]	validation_0-auc:0.76771
[280]	validation_0-auc:0.76774
[281]	validation_0-auc:0.76775
[282]	validation_0-auc:0.76775
[283]	validation_0-auc:0.76780
[284]	validation_0-auc:0.76779
[285]	validation_0-auc:0.76800
[286]	validation_0-auc:0.76805
[287]	validation_0-auc:0.76800
[288]	validation_0-auc:0.76810
[289]	validation_0-auc:0.76812
[290]	validation_0-auc:0.76815
[291]	validation_0-auc:0.76817
[292]	validation_0-auc:0.76821
[293]	validation_0-auc:0.76821
[294]	validation_0-auc:0.76824
[295]	validation_0-auc:0.76826
[296]	validation_0-auc:0.76843
[297]	validation_0-auc:0.76852
[298]	validation_0-auc:0.76860
[299]	validation_0-auc:0.76856
[300]	validation_0-auc:0.76855
[301]	validation_0-auc:0.76862
[302]	validation_0-auc:0.76859
[303]	validation_0-auc:0.76858
[304]	validation_0-auc:0.76861
[305]	validation_0-auc:0.76864
[306]	validation_0-auc:0.76859
[307]	validation_0-auc:0.76860
[308]	validation_0-auc:0.76861
[309]	validation_0-auc:0.76862
[310]	validation_0-auc:0.76858
[311]	validation_0-auc:0.76860
[312]	validation_0-auc:0.76865
[313]	validation_0-auc:0.76869
[314]	validation_0-auc:0.76870
[315]	validation_0-auc:0.76870
[316]	validation_0-auc:0.76873
[317]	validation_0-auc:0.76875
[318]	validation_0-auc:0.76877
[319]	validation_0-auc:0.76875
[320]	validation_0-auc:0.76884
[321]	validation_0-auc:0.76883
[322]	validation_0-auc:0.76881
[323]	validation_0-auc:0.76885
[324]	validation_0-auc:0.76888
[325]	validation_0-auc:0.76893
[326]	validation_0-auc:0.76891
[327]	validation_0-auc:0.76899
[328]	validation_0-auc:0.76902
[329]	validation_0-auc:0.76903
[330]	validation_0-auc:0.76904
[331]	validation_0-auc:0.76911
[332]	validation_0-auc:0.76910
[333]	validation_0-auc:0.76914
[334]	validation_0-auc:0.76914
[335]	validation_0-auc:0.76918
[336]	validation_0-auc:0.76917
[337]	validation_0-auc:0.76923
[338]	validation_0-auc:0.76931
[339]	validation_0-auc:0.76932
[340]	validation_0-auc:0.76929
[341]	validation_0-auc:0.76930
[342]	validation_0-auc:0.76935
[343]	validation_0-auc:0.76936
[344]	validation_0-auc:0.76939
[345]	validation_0-auc:0.76939
[346]	validation_0-auc:0.76937
[347]	validation_0-auc:0.76939
[348]	validation_0-auc:0.76940
[349]	validation_0-auc:0.76935
[350]	validation_0-auc:0.76935
[351]	validation_0-auc:0.76933
[352]	validation_0-auc:0.76933
[353]	validation_0-auc:0.76938
[354]	validation_0-auc:0.76940
[355]	validation_0-auc:0.76953
[356]	validation_0-auc:0.76953
[357]	validation_0-auc:0.76951
[358]	validation_0-auc:0.76955
[359]	validation_0-auc:0.76954
[360]	validation_0-auc:0.76959
[361]	validation_0-auc:0.76956
[362]	validation_0-auc:0.76958
[363]	validation_0-auc:0.76958
[364]	validation_0-auc:0.76962
[365]	validation_0-auc:0.76967
[366]	validation_0-auc:0.76965
[367]	validation_0-auc:0.76963
[368]	validation_0-auc:0.76964
[369]	validation_0-auc:0.76965
[370]	validation_0-auc:0.76967
[371]	validation_0-auc:0.76972
[372]	validation_0-auc:0.76970
[373]	validation_0-auc:0.76972
[374]	validation_0-auc:0.76974
[375]	validation_0-auc:0.76972
[376]	validation_0-auc:0.76976
[377]	validation_0-auc:0.76980
[378]	validation_0-auc:0.76978
[379]	validation_0-auc:0.76980
[380]	validation_0-auc:0.76983
[381]	validation_0-auc:0.76983
[382]	validation_0-auc:0.76986
[383]	validation_0-auc:0.76988
[384]	validation_0-auc:0.76987
[385]	validation_0-auc:0.76983
[386]	validation_0-auc:0.76981
[387]	validation_0-auc:0.76998
[388]	validation_0-auc:0.76997
[389]	validation_0-auc:0.77002
[390]	validation_0-auc:0.77006
[391]	validation_0-auc:0.77002
[392]	validation_0-auc:0.77006
[393]	validation_0-auc:0.77007
[394]	validation_0-auc:0.77005
[395]	validation_0-auc:0.77010
[396]	validation_0-auc:0.77015
[397]	validation_0-auc:0.77015
[398]	validation_0-auc:0.77015
[399]	validation_0-auc:0.77015
[400]	validation_0-auc:0.77017
[401]	validation_0-auc:0.77020
[402]	validation_0-auc:0.77020
[403]	validation_0-auc:0.77028
[404]	validation_0-auc:0.77028
[405]	validation_0-auc:0.77028
[406]	validation_0-auc:0.77030
[407]	validation_0-auc:0.77028
[408]	validation_0-auc:0.77032
[409]	validation_0-auc:0.77033
[410]	validation_0-auc:0.77036
[411]	validation_0-auc:0.77039
[412]	validation_0-auc:0.77039
[413]	validation_0-auc:0.77033
[414]	validation_0-auc:0.77035
[415]	validation_0-auc:0.77032
[416]	validation_0-auc:0.77032
[417]	validation_0-auc:0.77036
[418]	validation_0-auc:0.77041
[419]	validation_0-auc:0.77039
[420]	validation_0-auc:0.77038
[421]	validation_0-auc:0.77034
[422]	validation_0-auc:0.77040
[423]	validation_0-auc:0.77039
[424]	validation_0-auc:0.77039
[425]	validation_0-auc:0.77036
[426]	validation_0-auc:0.77036
[427]	validation_0-auc:0.77035
[428]	validation_0-auc:0.77034
[429]	validation_0-auc:0.77043
[430]	validation_0-auc:0.77037
[431]	validation_0-auc:0.77043
[432]	validation_0-auc:0.77044
[433]	validation_0-auc:0.77051
[434]	validation_0-auc:0.77055
[435]	validation_0-auc:0.77060
[436]	validation_0-auc:0.77061
[437]	validation_0-auc:0.77060
[438]	validation_0-auc:0.77058
[439]	validation_0-auc:0.77058
[440]	validation_0-auc:0.77063
[441]	validation_0-auc:0.77062
[442]	validation_0-auc:0.77064
[443]	validation_0-auc:0.77065
[444]	validation_0-auc:0.77065
[445]	validation_0-auc:0.77061
[446]	validation_0-auc:0.77059
[447]	validation_0-auc:0.77060
[448]	validation_0-auc:0.77055
[449]	validation_0-auc:0.77052
[450]	validation_0-auc:0.77054
[451]	validation_0-auc:0.77055
[452]	validation_0-auc:0.77053
[453]	validation_0-auc:0.77056
[454]	validation_0-auc:0.77064
[455]	validation_0-auc:0.77065
[456]	validation_0-auc:0.77063
[457]	validation_0-auc:0.77062
[458]	validation_0-auc:0.77063
[459]	validation_0-auc:0.77066
[460]	validation_0-auc:0.77072
[461]	validation_0-auc:0.77077
[462]	validation_0-auc:0.77075
[463]	validation_0-auc:0.77077
[464]	validation_0-auc:0.77080
[465]	validation_0-auc:0.77077
[466]	validation_0-auc:0.77074
[467]	validation_0-auc:0.77075
[468]	validation_0-auc:0.77076
[469]	validation_0-auc:0.77078
[470]	validation_0-auc:0.77076
[471]	validation_0-auc:0.77079
[472]	validation_0-auc:0.77077
[473]	validation_0-auc:0.77078
[474]	validation_0-auc:0.77079
[475]	validation_0-auc:0.77079
[476]	validation_0-auc:0.77080
[477]	validation_0-auc:0.77083
[478]	validation_0-auc:0.77086
[479]	validation_0-auc:0.77084
[480]	validation_0-auc:0.77082
[481]	validation_0-auc:0.77081
[482]	validation_0-auc:0.77080
[483]	validation_0-auc:0.77077
[484]	validation_0-auc:0.77081
[485]	validation_0-auc:0.77080
[486]	validation_0-auc:0.77079
[487]	validation_0-auc:0.77081
[488]	validation_0-auc:0.77084
[489]	validation_0-auc:0.77087
[490]	validation_0-auc:0.77089
[491]	validation_0-auc:0.77089
[492]	validation_0-auc:0.77092
[493]	validation_0-auc:0.77089
[494]	validation_0-auc:0.77092
[495]	validation_0-auc:0.77092
[496]	validation_0-auc:0.77092
[497]	validation_0-auc:0.77095
[498]	validation_0-auc:0.77094
[499]	validation_0-auc:0.77107
[500]	validation_0-auc:0.77115
[501]	validation_0-auc:0.77118
[502]	validation_0-auc:0.77121
[503]	validation_0-auc:0.77120
[504]	validation_0-auc:0.77121
[505]	validation_0-auc:0.77118
[506]	validation_0-auc:0.77119
[507]	validation_0-auc:0.77119
[508]	validation_0-auc:0.77119
[509]	validation_0-auc:0.77118
[510]	validation_0-auc:0.77122
[511]	validation_0-auc:0.77124
[512]	validation_0-auc:0.77125
[513]	validation_0-auc:0.77126
[514]	validation_0-auc:0.77123
[515]	validation_0-auc:0.77119
[516]	validation_0-auc:0.77119
[517]	validation_0-auc:0.77121
[518]	validation_0-auc:0.77124
[519]	validation_0-auc:0.77128
[520]	validation_0-auc:0.77131
[521]	validation_0-auc:0.77138
[522]	validation_0-auc:0.77133
[523]	validation_0-auc:0.77136
[524]	validation_0-auc:0.77137
[525]	validation_0-auc:0.77139
[526]	validation_0-auc:0.77136
[527]	validation_0-auc:0.77138
[528]	validation_0-auc:0.77137
[529]	validation_0-auc:0.77137
[530]	validation_0-auc:0.77139
[531]	validation_0-auc:0.77135
[532]	validation_0-auc:0.77136
[533]	validation_0-auc:0.77137
[534]	validation_0-auc:0.77150
[535]	validation_0-auc:0.77152
[536]	validation_0-auc:0.77153
[537]	validation_0-auc:0.77151
[538]	validation_0-auc:0.77152
[539]	validation_0-auc:0.77155
[540]	validation_0-auc:0.77151
[541]	validation_0-auc:0.77152
[542]	validation_0-auc:0.77152
[543]	validation_0-auc:0.77147
[544]	validation_0-auc:0.77147
[545]	validation_0-auc:0.77144
[546]	validation_0-auc:0.77145
[547]	validation_0-auc:0.77150
[548]	validation_0-auc:0.77149
[549]	validation_0-auc:0.77147
[550]	validation_0-auc:0.77151
[551]	validation_0-auc:0.77154
[552]	validation_0-auc:0.77156
[553]	validation_0-auc:0.77155
[554]	validation_0-auc:0.77154
[555]	validation_0-auc:0.77151
[556]	validation_0-auc:0.77155
[557]	validation_0-auc:0.77159
[558]	validation_0-auc:0.77161
[559]	validation_0-auc:0.77166
[560]	validation_0-auc:0.77164
[561]	validation_0-auc:0.77164
[562]	validation_0-auc:0.77171
[563]	validation_0-auc:0.77170
[564]	validation_0-auc:0.77170
[565]	validation_0-auc:0.77168
[566]	validation_0-auc:0.77165
[567]	validation_0-auc:0.77165
[568]	validation_0-auc:0.77166
[569]	validation_0-auc:0.77167
[570]	validation_0-auc:0.77167
[571]	validation_0-auc:0.77164
[572]	validation_0-auc:0.77165
[573]	validation_0-auc:0.77169
[574]	validation_0-auc:0.77171
[575]	validation_0-auc:0.77185
[576]	validation_0-auc:0.77185
[577]	validation_0-auc:0.77186
[578]	validation_0-auc:0.77185
[579]	validation_0-auc:0.77188
[580]	validation_0-auc:0.77190
[581]	validation_0-auc:0.77185
[582]	validation_0-auc:0.77189
[583]	validation_0-auc:0.77185
[584]	validation_0-auc:0.77187
[585]	validation_0-auc:0.77190
[586]	validation_0-auc:0.77191
[587]	validation_0-auc:0.77190
[588]	validation_0-auc:0.77189
[589]	validation_0-auc:0.77188
[590]	validation_0-auc:0.77191
[591]	validation_0-auc:0.77189
[592]	validation_0-auc:0.77191
[593]	validation_0-auc:0.77190
[594]	validation_0-auc:0.77194
[595]	validation_0-auc:0.77191
[596]	validation_0-auc:0.77191
[597]	validation_0-auc:0.77189
[598]	validation_0-auc:0.77188
[599]	validation_0-auc:0.77188
[600]	validation_0-auc:0.77190
[601]	validation_0-auc:0.77191
[602]	validation_0-auc:0.77187
[603]	validation_0-auc:0.77184
[604]	validation_0-auc:0.77183
[605]	validation_0-auc:0.77180
[606]	validation_0-auc:0.77183
[607]	validation_0-auc:0.77180
[608]	validation_0-auc:0.77177
[609]	validation_0-auc:0.77181
[610]	validation_0-auc:0.77182
[611]	validation_0-auc:0.77184
[612]	validation_0-auc:0.77186
[613]	validation_0-auc:0.77184
[614]	validation_0-auc:0.77185
[615]	validation_0-auc:0.77191
[616]	validation_0-auc:0.77191
[617]	validation_0-auc:0.77193
[618]	validation_0-auc:0.77190
[619]	validation_0-auc:0.77189
[620]	validation_0-auc:0.77192
[621]	validation_0-auc:0.77193
[622]	validation_0-auc:0.77193
[623]	validation_0-auc:0.77191
[624]	validation_0-auc:0.77195
[625]	validation_0-auc:0.77197
[626]	validation_0-auc:0.77208
[627]	validation_0-auc:0.77208
[628]	validation_0-auc:0.77209
[629]	validation_0-auc:0.77209
[630]	validation_0-auc:0.77213
[631]	validation_0-auc:0.77212
[632]	validation_0-auc:0.77209
[633]	validation_0-auc:0.77206
[634]	validation_0-auc:0.77204
[635]	validation_0-auc:0.77205
[636]	validation_0-auc:0.77205
[637]	validation_0-auc:0.77205
[638]	validation_0-auc:0.77204
[639]	validation_0-auc:0.77212
[640]	validation_0-auc:0.77216
[641]	validation_0-auc:0.77217
[642]	validation_0-auc:0.77218
[643]	validation_0-auc:0.77216
[644]	validation_0-auc:0.77218
[645]	validation_0-auc:0.77221
[646]	validation_0-auc:0.77217
[647]	validation_0-auc:0.77217
[648]	validation_0-auc:0.77211
[649]	validation_0-auc:0.77211
[650]	validation_0-auc:0.77214
[651]	validation_0-auc:0.77216
[652]	validation_0-auc:0.77214
[653]	validation_0-auc:0.77215
[654]	validation_0-auc:0.77221
[655]	validation_0-auc:0.77227
[656]	validation_0-auc:0.77224
[657]	validation_0-auc:0.77223
[658]	validation_0-auc:0.77230
[659]	validation_0-auc:0.77228
[660]	validation_0-auc:0.77226
[661]	validation_0-auc:0.77227
[662]	validation_0-auc:0.77231
[663]	validation_0-auc:0.77231
[664]	validation_0-auc:0.77230
[665]	validation_0-auc:0.77230
[666]	validation_0-auc:0.77229
[667]	validation_0-auc:0.77226
[668]	validation_0-auc:0.77236
[669]	validation_0-auc:0.77235
[670]	validation_0-auc:0.77236
[671]	validation_0-auc:0.77236
[672]	validation_0-auc:0.77238
[673]	validation_0-auc:0.77236
[674]	validation_0-auc:0.77233
[675]	validation_0-auc:0.77228
[676]	validation_0-auc:0.77228
[677]	validation_0-auc:0.77228
[678]	validation_0-auc:0.77226
[679]	validation_0-auc:0.77222
[680]	validation_0-auc:0.77217
[681]	validation_0-auc:0.77213
[682]	validation_0-auc:0.77215
[683]	validation_0-auc:0.77214
[684]	validation_0-auc:0.77216
[685]	validation_0-auc:0.77211
[686]	validation_0-auc:0.77211
[687]	validation_0-auc:0.77210
[688]	validation_0-auc:0.77212
[689]	validation_0-auc:0.77209
[690]	validation_0-auc:0.77206
[691]	validation_0-auc:0.77204
[692]	validation_0-auc:0.77203
[693]	validation_0-auc:0.77204
[694]	validation_0-auc:0.77203
[695]	validation_0-auc:0.77206
[696]	validation_0-auc:0.77212
[697]	validation_0-auc:0.77220
[698]	validation_0-auc:0.77221
[699]	validation_0-auc:0.77218
[700]	validation_0-auc:0.77223
[701]	validation_0-auc:0.77227
[702]	validation_0-auc:0.77230
[703]	validation_0-auc:0.77230
[704]	validation_0-auc:0.77229
[705]	validation_0-auc:0.77231
[706]	validation_0-auc:0.77229
[707]	validation_0-auc:0.77229
[708]	validation_0-auc:0.77228
[709]	validation_0-auc:0.77229
[710]	validation_0-auc:0.77232
[711]	validation_0-auc:0.77230
[712]	validation_0-auc:0.77226
[713]	validation_0-auc:0.77224
[714]	validation_0-auc:0.77226
[715]	validation_0-auc:0.77224
[716]	validation_0-auc:0.77224
[717]	validation_0-auc:0.77229
[718]	validation_0-auc:0.77232
[719]	validation_0-auc:0.77231
[720]	validation_0-auc:0.77228
[721]	validation_0-auc:0.77229
[722]	validation_0-auc:0.77232
Stopping. Best iteration:
[672]	validation_0-auc:0.77238

****************************************


----------------------------------------


****************************************


Best Parameters:
	predictor__colsample_bytree: 0.8
	predictor__eta: 0.05
	predictor__max_depth: 3
	predictor__min_child_weight: 20
	predictor__n_estimators: 1000
	predictor__verbose_eval: False
	predictor__verbosity: 0
****** FINISH Best Model Unbalanced: XGBoost  *****

----------------------------------------
****************************************
----------------------------------------
****** START Best Model Unbalanced: LightGBM *****
Parameters:
	boosting_type: ['gbdt']
	colsample_bytree: [0.8, 0.95]
	is_unbalance: [True]
	learning_rate: [0.01, 0.05, 0.1]
	min_split_gain: [0.01, 0.02, 0.05]
	n_estimators: [5000, 10000]
	num_leaves: [30, 35]
	reg_alpha: [0.01, 0.02, 0.05]
	reg_lambda: [0.01, 0.02, 0.05]
Fitting 5 folds for each of 648 candidates, totalling 3240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 190 tasks      | elapsed:  9.8min
[Parallel(n_jobs=-1)]: Done 440 tasks      | elapsed: 21.5min
[Parallel(n_jobs=-1)]: Done 790 tasks      | elapsed: 34.2min
[Parallel(n_jobs=-1)]: Done 1240 tasks      | elapsed: 50.0min
[Parallel(n_jobs=-1)]: Done 1790 tasks      | elapsed: 70.8min
[Parallel(n_jobs=-1)]: Done 2440 tasks      | elapsed: 94.3min
[Parallel(n_jobs=-1)]: Done 3190 tasks      | elapsed: 121.3min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed: 123.0min finished
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[6]	valid's auc: 0.730232	valid's binary_logloss: 0.276195
****************************************


----------------------------------------


****************************************


Best Parameters:
	predictor__boosting_type: gbdt
	predictor__colsample_bytree: 0.8
	predictor__is_unbalance: True
	predictor__learning_rate: 0.01
	predictor__min_split_gain: 0.01
	predictor__n_estimators: 5000
	predictor__num_leaves: 35
	predictor__reg_alpha: 0.05
	predictor__reg_lambda: 0.02
****** FINISH Best Model Unbalanced: LightGBM  *****

----------------------------------------
****************************************
----------------------------------------
Model Train Accuracy Score Train AUC Score Test Accuracy Score Test AUC Score Training Time Experiments Description
0 Best Model Unbalanced:XGBoost 0.922 0.823 0.920 0.779 233.2998 360 [["predictor__colsample_bytree", 0.8], ["predi...
1 Best Model Unbalanced:LightGBM 0.922 0.990 0.842 0.760 125.9134 3240 [["predictor__boosting_type", "gbdt"], ["predi...
CPU times: user 45min 47s, sys: 25.1 s, total: 46min 12s
Wall time: 14h 6min 15s

Manually Balanced Data Set (Ratio = 0.25)

In [10]:
def manual_balancing(df, ratio = 0.25):
    assert (ratio <= 0.5), "ratio must be less than or equal to 0.5"
    target1 = df[df['TARGET'] == 1]
    target0 = df[df['TARGET'] == 0]
    ratio_eq = ratio / (1-ratio)
    n = int(min(target1.shape[0]/ratio_eq, target0.shape[0]))
    
    majority_df = target0.sample(n, random_state = 42)
    df_balanced = pd.concat([target1, majority_df]).sample(frac=1).reset_index(drop = True)
    return df_balanced 
In [56]:
df_balanced = manual_balancing(df_train_reduced, 0.25)
X_bal = df_balanced.drop('TARGET', axis = 1).values
y_bal = df_balanced['TARGET'].values
df_balanced['TARGET'].value_counts(normalize = True).round(2).plot.pie(autopct='%1.1f%%')
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f482a37eb70>
In [57]:
X_train, X_test, y_train, y_test = train_test_split(X_bal, y_bal , stratify = y_bal, 
                                                    test_size = 0.3, random_state = 42)
In [58]:
%%time
# This might take a while
if __name__ == "__main__":

    ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model Manual Balanced 50%:",  n_jobs=-1,verbose=1)
****** START Best Model Manual Balanced: XGBoost *****
Parameters:
	colsample_bytree: [0.8, 0.95]
	eta: [0.01, 0.05, 0.1]
	max_depth: [3, 6, 8]
	min_child_weight: [20, 35]
	n_estimators: [500, 1000]
	verbose_eval: [False]
	verbosity: [0]
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 18.8min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 62.3min
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed: 124.8min finished
[0]	validation_0-auc:0.71776
Will train until validation_0-auc hasn't improved in 50 rounds.
[1]	validation_0-auc:0.72874
[2]	validation_0-auc:0.73538
[3]	validation_0-auc:0.73559
[4]	validation_0-auc:0.73552
[5]	validation_0-auc:0.73644
[6]	validation_0-auc:0.73782
[7]	validation_0-auc:0.73755
[8]	validation_0-auc:0.73750
[9]	validation_0-auc:0.73876
[10]	validation_0-auc:0.73857
[11]	validation_0-auc:0.74057
[12]	validation_0-auc:0.74110
[13]	validation_0-auc:0.74157
[14]	validation_0-auc:0.74159
[15]	validation_0-auc:0.74238
[16]	validation_0-auc:0.74296
[17]	validation_0-auc:0.74292
[18]	validation_0-auc:0.74401
[19]	validation_0-auc:0.74441
[20]	validation_0-auc:0.74466
[21]	validation_0-auc:0.74465
[22]	validation_0-auc:0.74503
[23]	validation_0-auc:0.74493
[24]	validation_0-auc:0.74475
[25]	validation_0-auc:0.74455
[26]	validation_0-auc:0.74492
[27]	validation_0-auc:0.74525
[28]	validation_0-auc:0.74535
[29]	validation_0-auc:0.74572
[30]	validation_0-auc:0.74567
[31]	validation_0-auc:0.74555
[32]	validation_0-auc:0.74599
[33]	validation_0-auc:0.74617
[34]	validation_0-auc:0.74648
[35]	validation_0-auc:0.74636
[36]	validation_0-auc:0.74641
[37]	validation_0-auc:0.74675
[38]	validation_0-auc:0.74692
[39]	validation_0-auc:0.74682
[40]	validation_0-auc:0.74660
[41]	validation_0-auc:0.74657
[42]	validation_0-auc:0.74646
[43]	validation_0-auc:0.74665
[44]	validation_0-auc:0.74663
[45]	validation_0-auc:0.74684
[46]	validation_0-auc:0.74714
[47]	validation_0-auc:0.74732
[48]	validation_0-auc:0.74748
[49]	validation_0-auc:0.74744
[50]	validation_0-auc:0.74739
[51]	validation_0-auc:0.74755
[52]	validation_0-auc:0.74760
[53]	validation_0-auc:0.74755
[54]	validation_0-auc:0.74749
[55]	validation_0-auc:0.74759
[56]	validation_0-auc:0.74748
[57]	validation_0-auc:0.74793
[58]	validation_0-auc:0.74807
[59]	validation_0-auc:0.74830
[60]	validation_0-auc:0.74838
[61]	validation_0-auc:0.74846
[62]	validation_0-auc:0.74859
[63]	validation_0-auc:0.74872
[64]	validation_0-auc:0.74884
[65]	validation_0-auc:0.74897
[66]	validation_0-auc:0.74899
[67]	validation_0-auc:0.74906
[68]	validation_0-auc:0.74915
[69]	validation_0-auc:0.74929
[70]	validation_0-auc:0.74935
[71]	validation_0-auc:0.74952
[72]	validation_0-auc:0.74955
[73]	validation_0-auc:0.74971
[74]	validation_0-auc:0.74972
[75]	validation_0-auc:0.74980
[76]	validation_0-auc:0.74984
[77]	validation_0-auc:0.74995
[78]	validation_0-auc:0.75012
[79]	validation_0-auc:0.75008
[80]	validation_0-auc:0.75011
[81]	validation_0-auc:0.75014
[82]	validation_0-auc:0.75009
[83]	validation_0-auc:0.75012
[84]	validation_0-auc:0.75019
[85]	validation_0-auc:0.75025
[86]	validation_0-auc:0.75026
[87]	validation_0-auc:0.75038
[88]	validation_0-auc:0.75047
[89]	validation_0-auc:0.75053
[90]	validation_0-auc:0.75065
[91]	validation_0-auc:0.75074
[92]	validation_0-auc:0.75077
[93]	validation_0-auc:0.75080
[94]	validation_0-auc:0.75077
[95]	validation_0-auc:0.75078
[96]	validation_0-auc:0.75097
[97]	validation_0-auc:0.75084
[98]	validation_0-auc:0.75084
[99]	validation_0-auc:0.75101
[100]	validation_0-auc:0.75122
[101]	validation_0-auc:0.75132
[102]	validation_0-auc:0.75161
[103]	validation_0-auc:0.75170
[104]	validation_0-auc:0.75175
[105]	validation_0-auc:0.75194
[106]	validation_0-auc:0.75198
[107]	validation_0-auc:0.75214
[108]	validation_0-auc:0.75228
[109]	validation_0-auc:0.75235
[110]	validation_0-auc:0.75260
[111]	validation_0-auc:0.75276
[112]	validation_0-auc:0.75289
[113]	validation_0-auc:0.75296
[114]	validation_0-auc:0.75308
[115]	validation_0-auc:0.75313
[116]	validation_0-auc:0.75319
[117]	validation_0-auc:0.75338
[118]	validation_0-auc:0.75358
[119]	validation_0-auc:0.75368
[120]	validation_0-auc:0.75373
[121]	validation_0-auc:0.75378
[122]	validation_0-auc:0.75399
[123]	validation_0-auc:0.75409
[124]	validation_0-auc:0.75417
[125]	validation_0-auc:0.75421
[126]	validation_0-auc:0.75430
[127]	validation_0-auc:0.75444
[128]	validation_0-auc:0.75457
[129]	validation_0-auc:0.75467
[130]	validation_0-auc:0.75478
[131]	validation_0-auc:0.75477
[132]	validation_0-auc:0.75493
[133]	validation_0-auc:0.75509
[134]	validation_0-auc:0.75521
[135]	validation_0-auc:0.75528
[136]	validation_0-auc:0.75539
[137]	validation_0-auc:0.75560
[138]	validation_0-auc:0.75574
[139]	validation_0-auc:0.75578
[140]	validation_0-auc:0.75594
[141]	validation_0-auc:0.75610
[142]	validation_0-auc:0.75623
[143]	validation_0-auc:0.75628
[144]	validation_0-auc:0.75633
[145]	validation_0-auc:0.75645
[146]	validation_0-auc:0.75659
[147]	validation_0-auc:0.75664
[148]	validation_0-auc:0.75672
[149]	validation_0-auc:0.75680
[150]	validation_0-auc:0.75696
[151]	validation_0-auc:0.75702
[152]	validation_0-auc:0.75706
[153]	validation_0-auc:0.75718
[154]	validation_0-auc:0.75729
[155]	validation_0-auc:0.75741
[156]	validation_0-auc:0.75754
[157]	validation_0-auc:0.75766
[158]	validation_0-auc:0.75775
[159]	validation_0-auc:0.75778
[160]	validation_0-auc:0.75779
[161]	validation_0-auc:0.75785
[162]	validation_0-auc:0.75794
[163]	validation_0-auc:0.75807
[164]	validation_0-auc:0.75811
[165]	validation_0-auc:0.75816
[166]	validation_0-auc:0.75821
[167]	validation_0-auc:0.75829
[168]	validation_0-auc:0.75838
[169]	validation_0-auc:0.75851
[170]	validation_0-auc:0.75857
[171]	validation_0-auc:0.75875
[172]	validation_0-auc:0.75892
[173]	validation_0-auc:0.75898
[174]	validation_0-auc:0.75910
[175]	validation_0-auc:0.75912
[176]	validation_0-auc:0.75921
[177]	validation_0-auc:0.75933
[178]	validation_0-auc:0.75930
[179]	validation_0-auc:0.75944
[180]	validation_0-auc:0.75954
[181]	validation_0-auc:0.75958
[182]	validation_0-auc:0.75969
[183]	validation_0-auc:0.75970
[184]	validation_0-auc:0.75980
[185]	validation_0-auc:0.75986
[186]	validation_0-auc:0.75994
[187]	validation_0-auc:0.75997
[188]	validation_0-auc:0.75998
[189]	validation_0-auc:0.76007
[190]	validation_0-auc:0.76015
[191]	validation_0-auc:0.76024
[192]	validation_0-auc:0.76032
[193]	validation_0-auc:0.76041
[194]	validation_0-auc:0.76045
[195]	validation_0-auc:0.76046
[196]	validation_0-auc:0.76056
[197]	validation_0-auc:0.76058
[198]	validation_0-auc:0.76064
[199]	validation_0-auc:0.76066
[200]	validation_0-auc:0.76067
[201]	validation_0-auc:0.76074
[202]	validation_0-auc:0.76079
[203]	validation_0-auc:0.76086
[204]	validation_0-auc:0.76085
[205]	validation_0-auc:0.76090
[206]	validation_0-auc:0.76097
[207]	validation_0-auc:0.76100
[208]	validation_0-auc:0.76105
[209]	validation_0-auc:0.76116
[210]	validation_0-auc:0.76125
[211]	validation_0-auc:0.76129
[212]	validation_0-auc:0.76131
[213]	validation_0-auc:0.76138
[214]	validation_0-auc:0.76141
[215]	validation_0-auc:0.76144
[216]	validation_0-auc:0.76157
[217]	validation_0-auc:0.76168
[218]	validation_0-auc:0.76170
[219]	validation_0-auc:0.76172
[220]	validation_0-auc:0.76184
[221]	validation_0-auc:0.76188
[222]	validation_0-auc:0.76194
[223]	validation_0-auc:0.76191
[224]	validation_0-auc:0.76198
[225]	validation_0-auc:0.76202
[226]	validation_0-auc:0.76202
[227]	validation_0-auc:0.76206
[228]	validation_0-auc:0.76207
[229]	validation_0-auc:0.76207
[230]	validation_0-auc:0.76209
[231]	validation_0-auc:0.76216
[232]	validation_0-auc:0.76216
[233]	validation_0-auc:0.76219
[234]	validation_0-auc:0.76220
[235]	validation_0-auc:0.76224
[236]	validation_0-auc:0.76226
[237]	validation_0-auc:0.76229
[238]	validation_0-auc:0.76238
[239]	validation_0-auc:0.76239
[240]	validation_0-auc:0.76250
[241]	validation_0-auc:0.76256
[242]	validation_0-auc:0.76261
[243]	validation_0-auc:0.76267
[244]	validation_0-auc:0.76272
[245]	validation_0-auc:0.76283
[246]	validation_0-auc:0.76283
[247]	validation_0-auc:0.76282
[248]	validation_0-auc:0.76282
[249]	validation_0-auc:0.76292
[250]	validation_0-auc:0.76292
[251]	validation_0-auc:0.76299
[252]	validation_0-auc:0.76303
[253]	validation_0-auc:0.76311
[254]	validation_0-auc:0.76320
[255]	validation_0-auc:0.76323
[256]	validation_0-auc:0.76331
[257]	validation_0-auc:0.76330
[258]	validation_0-auc:0.76335
[259]	validation_0-auc:0.76341
[260]	validation_0-auc:0.76345
[261]	validation_0-auc:0.76351
[262]	validation_0-auc:0.76358
[263]	validation_0-auc:0.76364
[264]	validation_0-auc:0.76370
[265]	validation_0-auc:0.76372
[266]	validation_0-auc:0.76377
[267]	validation_0-auc:0.76383
[268]	validation_0-auc:0.76387
[269]	validation_0-auc:0.76394
[270]	validation_0-auc:0.76397
[271]	validation_0-auc:0.76413
[272]	validation_0-auc:0.76418
[273]	validation_0-auc:0.76422
[274]	validation_0-auc:0.76426
[275]	validation_0-auc:0.76428
[276]	validation_0-auc:0.76435
[277]	validation_0-auc:0.76444
[278]	validation_0-auc:0.76448
[279]	validation_0-auc:0.76450
[280]	validation_0-auc:0.76456
[281]	validation_0-auc:0.76458
[282]	validation_0-auc:0.76458
[283]	validation_0-auc:0.76464
[284]	validation_0-auc:0.76465
[285]	validation_0-auc:0.76468
[286]	validation_0-auc:0.76469
[287]	validation_0-auc:0.76472
[288]	validation_0-auc:0.76476
[289]	validation_0-auc:0.76479
[290]	validation_0-auc:0.76483
[291]	validation_0-auc:0.76485
[292]	validation_0-auc:0.76489
[293]	validation_0-auc:0.76494
[294]	validation_0-auc:0.76497
[295]	validation_0-auc:0.76509
[296]	validation_0-auc:0.76517
[297]	validation_0-auc:0.76519
[298]	validation_0-auc:0.76523
[299]	validation_0-auc:0.76525
[300]	validation_0-auc:0.76526
[301]	validation_0-auc:0.76530
[302]	validation_0-auc:0.76539
[303]	validation_0-auc:0.76545
[304]	validation_0-auc:0.76550
[305]	validation_0-auc:0.76553
[306]	validation_0-auc:0.76563
[307]	validation_0-auc:0.76570
[308]	validation_0-auc:0.76578
[309]	validation_0-auc:0.76589
[310]	validation_0-auc:0.76595
[311]	validation_0-auc:0.76599
[312]	validation_0-auc:0.76607
[313]	validation_0-auc:0.76603
[314]	validation_0-auc:0.76610
[315]	validation_0-auc:0.76621
[316]	validation_0-auc:0.76625
[317]	validation_0-auc:0.76626
[318]	validation_0-auc:0.76633
[319]	validation_0-auc:0.76638
[320]	validation_0-auc:0.76641
[321]	validation_0-auc:0.76650
[322]	validation_0-auc:0.76657
[323]	validation_0-auc:0.76662
[324]	validation_0-auc:0.76669
[325]	validation_0-auc:0.76677
[326]	validation_0-auc:0.76686
[327]	validation_0-auc:0.76688
[328]	validation_0-auc:0.76691
[329]	validation_0-auc:0.76693
[330]	validation_0-auc:0.76702
[331]	validation_0-auc:0.76704
[332]	validation_0-auc:0.76707
[333]	validation_0-auc:0.76713
[334]	validation_0-auc:0.76719
[335]	validation_0-auc:0.76726
[336]	validation_0-auc:0.76730
[337]	validation_0-auc:0.76737
[338]	validation_0-auc:0.76740
[339]	validation_0-auc:0.76743
[340]	validation_0-auc:0.76750
[341]	validation_0-auc:0.76755
[342]	validation_0-auc:0.76760
[343]	validation_0-auc:0.76770
[344]	validation_0-auc:0.76774
[345]	validation_0-auc:0.76784
[346]	validation_0-auc:0.76789
[347]	validation_0-auc:0.76797
[348]	validation_0-auc:0.76809
[349]	validation_0-auc:0.76815
[350]	validation_0-auc:0.76821
[351]	validation_0-auc:0.76830
[352]	validation_0-auc:0.76835
[353]	validation_0-auc:0.76843
[354]	validation_0-auc:0.76845
[355]	validation_0-auc:0.76851
[356]	validation_0-auc:0.76855
[357]	validation_0-auc:0.76860
[358]	validation_0-auc:0.76864
[359]	validation_0-auc:0.76867
[360]	validation_0-auc:0.76872
[361]	validation_0-auc:0.76876
[362]	validation_0-auc:0.76879
[363]	validation_0-auc:0.76884
[364]	validation_0-auc:0.76886
[365]	validation_0-auc:0.76894
[366]	validation_0-auc:0.76898
[367]	validation_0-auc:0.76902
[368]	validation_0-auc:0.76905
[369]	validation_0-auc:0.76911
[370]	validation_0-auc:0.76913
[371]	validation_0-auc:0.76913
[372]	validation_0-auc:0.76919
[373]	validation_0-auc:0.76923
[374]	validation_0-auc:0.76928
[375]	validation_0-auc:0.76932
[376]	validation_0-auc:0.76938
[377]	validation_0-auc:0.76941
[378]	validation_0-auc:0.76951
[379]	validation_0-auc:0.76958
[380]	validation_0-auc:0.76960
[381]	validation_0-auc:0.76963
[382]	validation_0-auc:0.76968
[383]	validation_0-auc:0.76969
[384]	validation_0-auc:0.76974
[385]	validation_0-auc:0.76979
[386]	validation_0-auc:0.76981
[387]	validation_0-auc:0.76986
[388]	validation_0-auc:0.76992
[389]	validation_0-auc:0.77001
[390]	validation_0-auc:0.77006
[391]	validation_0-auc:0.77009
[392]	validation_0-auc:0.77013
[393]	validation_0-auc:0.77018
[394]	validation_0-auc:0.77019
[395]	validation_0-auc:0.77023
[396]	validation_0-auc:0.77026
[397]	validation_0-auc:0.77028
[398]	validation_0-auc:0.77031
[399]	validation_0-auc:0.77035
[400]	validation_0-auc:0.77036
[401]	validation_0-auc:0.77039
[402]	validation_0-auc:0.77043
[403]	validation_0-auc:0.77044
[404]	validation_0-auc:0.77046
[405]	validation_0-auc:0.77052
[406]	validation_0-auc:0.77057
[407]	validation_0-auc:0.77060
[408]	validation_0-auc:0.77061
[409]	validation_0-auc:0.77063
[410]	validation_0-auc:0.77067
[411]	validation_0-auc:0.77064
[412]	validation_0-auc:0.77072
[413]	validation_0-auc:0.77075
[414]	validation_0-auc:0.77078
[415]	validation_0-auc:0.77082
[416]	validation_0-auc:0.77085
[417]	validation_0-auc:0.77090
[418]	validation_0-auc:0.77096
[419]	validation_0-auc:0.77099
[420]	validation_0-auc:0.77104
[421]	validation_0-auc:0.77107
[422]	validation_0-auc:0.77111
[423]	validation_0-auc:0.77113
[424]	validation_0-auc:0.77120
[425]	validation_0-auc:0.77124
[426]	validation_0-auc:0.77127
[427]	validation_0-auc:0.77128
[428]	validation_0-auc:0.77129
[429]	validation_0-auc:0.77131
[430]	validation_0-auc:0.77134
[431]	validation_0-auc:0.77139
[432]	validation_0-auc:0.77141
[433]	validation_0-auc:0.77145
[434]	validation_0-auc:0.77147
[435]	validation_0-auc:0.77151
[436]	validation_0-auc:0.77153
[437]	validation_0-auc:0.77155
[438]	validation_0-auc:0.77153
[439]	validation_0-auc:0.77155
[440]	validation_0-auc:0.77155
[441]	validation_0-auc:0.77157
[442]	validation_0-auc:0.77162
[443]	validation_0-auc:0.77170
[444]	validation_0-auc:0.77170
[445]	validation_0-auc:0.77171
[446]	validation_0-auc:0.77173
[447]	validation_0-auc:0.77179
[448]	validation_0-auc:0.77180
[449]	validation_0-auc:0.77182
[450]	validation_0-auc:0.77183
[451]	validation_0-auc:0.77189
[452]	validation_0-auc:0.77190
[453]	validation_0-auc:0.77190
[454]	validation_0-auc:0.77191
[455]	validation_0-auc:0.77198
[456]	validation_0-auc:0.77200
[457]	validation_0-auc:0.77207
[458]	validation_0-auc:0.77212
[459]	validation_0-auc:0.77214
[460]	validation_0-auc:0.77216
[461]	validation_0-auc:0.77218
[462]	validation_0-auc:0.77222
[463]	validation_0-auc:0.77225
[464]	validation_0-auc:0.77224
[465]	validation_0-auc:0.77228
[466]	validation_0-auc:0.77235
[467]	validation_0-auc:0.77233
[468]	validation_0-auc:0.77232
[469]	validation_0-auc:0.77232
[470]	validation_0-auc:0.77238
[471]	validation_0-auc:0.77241
[472]	validation_0-auc:0.77242
[473]	validation_0-auc:0.77243
[474]	validation_0-auc:0.77247
[475]	validation_0-auc:0.77249
[476]	validation_0-auc:0.77251
[477]	validation_0-auc:0.77254
[478]	validation_0-auc:0.77257
[479]	validation_0-auc:0.77260
[480]	validation_0-auc:0.77266
[481]	validation_0-auc:0.77269
[482]	validation_0-auc:0.77272
[483]	validation_0-auc:0.77276
[484]	validation_0-auc:0.77277
[485]	validation_0-auc:0.77277
[486]	validation_0-auc:0.77282
[487]	validation_0-auc:0.77285
[488]	validation_0-auc:0.77287
[489]	validation_0-auc:0.77289
[490]	validation_0-auc:0.77287
[491]	validation_0-auc:0.77291
[492]	validation_0-auc:0.77297
[493]	validation_0-auc:0.77299
[494]	validation_0-auc:0.77300
[495]	validation_0-auc:0.77301
[496]	validation_0-auc:0.77303
[497]	validation_0-auc:0.77307
[498]	validation_0-auc:0.77308
[499]	validation_0-auc:0.77313
[500]	validation_0-auc:0.77314
[501]	validation_0-auc:0.77317
[502]	validation_0-auc:0.77319
[503]	validation_0-auc:0.77321
[504]	validation_0-auc:0.77322
[505]	validation_0-auc:0.77322
[506]	validation_0-auc:0.77321
[507]	validation_0-auc:0.77323
[508]	validation_0-auc:0.77325
[509]	validation_0-auc:0.77330
[510]	validation_0-auc:0.77332
[511]	validation_0-auc:0.77331
[512]	validation_0-auc:0.77332
[513]	validation_0-auc:0.77336
[514]	validation_0-auc:0.77340
[515]	validation_0-auc:0.77344
[516]	validation_0-auc:0.77347
[517]	validation_0-auc:0.77349
[518]	validation_0-auc:0.77357
[519]	validation_0-auc:0.77360
[520]	validation_0-auc:0.77365
[521]	validation_0-auc:0.77366
[522]	validation_0-auc:0.77369
[523]	validation_0-auc:0.77372
[524]	validation_0-auc:0.77376
[525]	validation_0-auc:0.77378
[526]	validation_0-auc:0.77380
[527]	validation_0-auc:0.77379
[528]	validation_0-auc:0.77380
[529]	validation_0-auc:0.77381
[530]	validation_0-auc:0.77387
[531]	validation_0-auc:0.77387
[532]	validation_0-auc:0.77387
[533]	validation_0-auc:0.77389
[534]	validation_0-auc:0.77390
[535]	validation_0-auc:0.77394
[536]	validation_0-auc:0.77398
[537]	validation_0-auc:0.77401
[538]	validation_0-auc:0.77403
[539]	validation_0-auc:0.77407
[540]	validation_0-auc:0.77406
[541]	validation_0-auc:0.77408
[542]	validation_0-auc:0.77411
[543]	validation_0-auc:0.77412
[544]	validation_0-auc:0.77413
[545]	validation_0-auc:0.77415
[546]	validation_0-auc:0.77416
[547]	validation_0-auc:0.77418
[548]	validation_0-auc:0.77418
[549]	validation_0-auc:0.77422
[550]	validation_0-auc:0.77423
[551]	validation_0-auc:0.77424
[552]	validation_0-auc:0.77428
[553]	validation_0-auc:0.77427
[554]	validation_0-auc:0.77429
[555]	validation_0-auc:0.77432
[556]	validation_0-auc:0.77433
[557]	validation_0-auc:0.77434
[558]	validation_0-auc:0.77437
[559]	validation_0-auc:0.77438
[560]	validation_0-auc:0.77440
[561]	validation_0-auc:0.77439
[562]	validation_0-auc:0.77441
[563]	validation_0-auc:0.77443
[564]	validation_0-auc:0.77448
[565]	validation_0-auc:0.77449
[566]	validation_0-auc:0.77453
[567]	validation_0-auc:0.77456
[568]	validation_0-auc:0.77456
[569]	validation_0-auc:0.77458
[570]	validation_0-auc:0.77465
[571]	validation_0-auc:0.77465
[572]	validation_0-auc:0.77467
[573]	validation_0-auc:0.77468
[574]	validation_0-auc:0.77470
[575]	validation_0-auc:0.77467
[576]	validation_0-auc:0.77466
[577]	validation_0-auc:0.77472
[578]	validation_0-auc:0.77473
[579]	validation_0-auc:0.77475
[580]	validation_0-auc:0.77475
[581]	validation_0-auc:0.77476
[582]	validation_0-auc:0.77478
[583]	validation_0-auc:0.77478
[584]	validation_0-auc:0.77480
[585]	validation_0-auc:0.77481
[586]	validation_0-auc:0.77484
[587]	validation_0-auc:0.77484
[588]	validation_0-auc:0.77488
[589]	validation_0-auc:0.77489
[590]	validation_0-auc:0.77492
[591]	validation_0-auc:0.77493
[592]	validation_0-auc:0.77494
[593]	validation_0-auc:0.77496
[594]	validation_0-auc:0.77500
[595]	validation_0-auc:0.77501
[596]	validation_0-auc:0.77501
[597]	validation_0-auc:0.77506
[598]	validation_0-auc:0.77505
[599]	validation_0-auc:0.77512
[600]	validation_0-auc:0.77512
[601]	validation_0-auc:0.77513
[602]	validation_0-auc:0.77519
[603]	validation_0-auc:0.77517
[604]	validation_0-auc:0.77519
[605]	validation_0-auc:0.77523
[606]	validation_0-auc:0.77523
[607]	validation_0-auc:0.77527
[608]	validation_0-auc:0.77527
[609]	validation_0-auc:0.77529
[610]	validation_0-auc:0.77529
[611]	validation_0-auc:0.77532
[612]	validation_0-auc:0.77531
[613]	validation_0-auc:0.77532
[614]	validation_0-auc:0.77532
[615]	validation_0-auc:0.77536
[616]	validation_0-auc:0.77538
[617]	validation_0-auc:0.77542
[618]	validation_0-auc:0.77543
[619]	validation_0-auc:0.77547
[620]	validation_0-auc:0.77547
[621]	validation_0-auc:0.77547
[622]	validation_0-auc:0.77552
[623]	validation_0-auc:0.77551
[624]	validation_0-auc:0.77554
[625]	validation_0-auc:0.77554
[626]	validation_0-auc:0.77555
[627]	validation_0-auc:0.77554
[628]	validation_0-auc:0.77556
[629]	validation_0-auc:0.77559
[630]	validation_0-auc:0.77565
[631]	validation_0-auc:0.77568
[632]	validation_0-auc:0.77568
[633]	validation_0-auc:0.77565
[634]	validation_0-auc:0.77566
[635]	validation_0-auc:0.77567
[636]	validation_0-auc:0.77568
[637]	validation_0-auc:0.77573
[638]	validation_0-auc:0.77575
[639]	validation_0-auc:0.77573
[640]	validation_0-auc:0.77575
[641]	validation_0-auc:0.77575
[642]	validation_0-auc:0.77576
[643]	validation_0-auc:0.77575
[644]	validation_0-auc:0.77578
[645]	validation_0-auc:0.77575
[646]	validation_0-auc:0.77577
[647]	validation_0-auc:0.77578
[648]	validation_0-auc:0.77577
[649]	validation_0-auc:0.77579
[650]	validation_0-auc:0.77579
[651]	validation_0-auc:0.77580
[652]	validation_0-auc:0.77582
[653]	validation_0-auc:0.77584
[654]	validation_0-auc:0.77586
[655]	validation_0-auc:0.77584
[656]	validation_0-auc:0.77587
[657]	validation_0-auc:0.77586
[658]	validation_0-auc:0.77587
[659]	validation_0-auc:0.77587
[660]	validation_0-auc:0.77589
[661]	validation_0-auc:0.77591
[662]	validation_0-auc:0.77594
[663]	validation_0-auc:0.77596
[664]	validation_0-auc:0.77598
[665]	validation_0-auc:0.77599
[666]	validation_0-auc:0.77600
[667]	validation_0-auc:0.77605
[668]	validation_0-auc:0.77604
[669]	validation_0-auc:0.77607
[670]	validation_0-auc:0.77606
[671]	validation_0-auc:0.77603
[672]	validation_0-auc:0.77606
[673]	validation_0-auc:0.77606
[674]	validation_0-auc:0.77607
[675]	validation_0-auc:0.77608
[676]	validation_0-auc:0.77612
[677]	validation_0-auc:0.77615
[678]	validation_0-auc:0.77615
[679]	validation_0-auc:0.77617
[680]	validation_0-auc:0.77620
[681]	validation_0-auc:0.77621
[682]	validation_0-auc:0.77620
[683]	validation_0-auc:0.77624
[684]	validation_0-auc:0.77626
[685]	validation_0-auc:0.77629
[686]	validation_0-auc:0.77631
[687]	validation_0-auc:0.77634
[688]	validation_0-auc:0.77638
[689]	validation_0-auc:0.77640
[690]	validation_0-auc:0.77641
[691]	validation_0-auc:0.77642
[692]	validation_0-auc:0.77642
[693]	validation_0-auc:0.77643
[694]	validation_0-auc:0.77645
[695]	validation_0-auc:0.77647
[696]	validation_0-auc:0.77649
[697]	validation_0-auc:0.77649
[698]	validation_0-auc:0.77651
[699]	validation_0-auc:0.77650
[700]	validation_0-auc:0.77650
[701]	validation_0-auc:0.77649
[702]	validation_0-auc:0.77646
[703]	validation_0-auc:0.77646
[704]	validation_0-auc:0.77653
[705]	validation_0-auc:0.77653
[706]	validation_0-auc:0.77653
[707]	validation_0-auc:0.77657
[708]	validation_0-auc:0.77657
[709]	validation_0-auc:0.77656
[710]	validation_0-auc:0.77653
[711]	validation_0-auc:0.77652
[712]	validation_0-auc:0.77653
[713]	validation_0-auc:0.77654
[714]	validation_0-auc:0.77657
[715]	validation_0-auc:0.77656
[716]	validation_0-auc:0.77658
[717]	validation_0-auc:0.77657
[718]	validation_0-auc:0.77658
[719]	validation_0-auc:0.77660
[720]	validation_0-auc:0.77660
[721]	validation_0-auc:0.77661
[722]	validation_0-auc:0.77663
[723]	validation_0-auc:0.77665
[724]	validation_0-auc:0.77667
[725]	validation_0-auc:0.77669
[726]	validation_0-auc:0.77670
[727]	validation_0-auc:0.77670
[728]	validation_0-auc:0.77670
[729]	validation_0-auc:0.77672
[730]	validation_0-auc:0.77671
[731]	validation_0-auc:0.77673
[732]	validation_0-auc:0.77677
[733]	validation_0-auc:0.77679
[734]	validation_0-auc:0.77677
[735]	validation_0-auc:0.77678
[736]	validation_0-auc:0.77681
[737]	validation_0-auc:0.77683
[738]	validation_0-auc:0.77686
[739]	validation_0-auc:0.77687
[740]	validation_0-auc:0.77690
[741]	validation_0-auc:0.77690
[742]	validation_0-auc:0.77690
[743]	validation_0-auc:0.77693
[744]	validation_0-auc:0.77694
[745]	validation_0-auc:0.77697
[746]	validation_0-auc:0.77698
[747]	validation_0-auc:0.77698
[748]	validation_0-auc:0.77698
[749]	validation_0-auc:0.77699
[750]	validation_0-auc:0.77705
[751]	validation_0-auc:0.77708
[752]	validation_0-auc:0.77711
[753]	validation_0-auc:0.77711
[754]	validation_0-auc:0.77713
[755]	validation_0-auc:0.77712
[756]	validation_0-auc:0.77712
[757]	validation_0-auc:0.77712
[758]	validation_0-auc:0.77712
[759]	validation_0-auc:0.77712
[760]	validation_0-auc:0.77716
[761]	validation_0-auc:0.77716
[762]	validation_0-auc:0.77716
[763]	validation_0-auc:0.77717
[764]	validation_0-auc:0.77717
[765]	validation_0-auc:0.77720
[766]	validation_0-auc:0.77720
[767]	validation_0-auc:0.77719
[768]	validation_0-auc:0.77721
[769]	validation_0-auc:0.77721
[770]	validation_0-auc:0.77721
[771]	validation_0-auc:0.77724
[772]	validation_0-auc:0.77726
[773]	validation_0-auc:0.77727
[774]	validation_0-auc:0.77729
[775]	validation_0-auc:0.77729
[776]	validation_0-auc:0.77730
[777]	validation_0-auc:0.77732
[778]	validation_0-auc:0.77730
[779]	validation_0-auc:0.77730
[780]	validation_0-auc:0.77731
[781]	validation_0-auc:0.77733
[782]	validation_0-auc:0.77734
[783]	validation_0-auc:0.77736
[784]	validation_0-auc:0.77737
[785]	validation_0-auc:0.77740
[786]	validation_0-auc:0.77740
[787]	validation_0-auc:0.77741
[788]	validation_0-auc:0.77742
[789]	validation_0-auc:0.77741
[790]	validation_0-auc:0.77742
[791]	validation_0-auc:0.77743
[792]	validation_0-auc:0.77744
[793]	validation_0-auc:0.77744
[794]	validation_0-auc:0.77745
[795]	validation_0-auc:0.77743
[796]	validation_0-auc:0.77746
[797]	validation_0-auc:0.77745
[798]	validation_0-auc:0.77745
[799]	validation_0-auc:0.77746
[800]	validation_0-auc:0.77746
[801]	validation_0-auc:0.77747
[802]	validation_0-auc:0.77746
[803]	validation_0-auc:0.77749
[804]	validation_0-auc:0.77750
[805]	validation_0-auc:0.77751
[806]	validation_0-auc:0.77752
[807]	validation_0-auc:0.77752
[808]	validation_0-auc:0.77751
[809]	validation_0-auc:0.77752
[810]	validation_0-auc:0.77752
[811]	validation_0-auc:0.77754
[812]	validation_0-auc:0.77757
[813]	validation_0-auc:0.77758
[814]	validation_0-auc:0.77757
[815]	validation_0-auc:0.77757
[816]	validation_0-auc:0.77761
[817]	validation_0-auc:0.77762
[818]	validation_0-auc:0.77763
[819]	validation_0-auc:0.77763
[820]	validation_0-auc:0.77764
[821]	validation_0-auc:0.77767
[822]	validation_0-auc:0.77766
[823]	validation_0-auc:0.77767
[824]	validation_0-auc:0.77767
[825]	validation_0-auc:0.77768
[826]	validation_0-auc:0.77769
[827]	validation_0-auc:0.77771
[828]	validation_0-auc:0.77773
[829]	validation_0-auc:0.77773
[830]	validation_0-auc:0.77774
[831]	validation_0-auc:0.77773
[832]	validation_0-auc:0.77775
[833]	validation_0-auc:0.77775
[834]	validation_0-auc:0.77778
[835]	validation_0-auc:0.77780
[836]	validation_0-auc:0.77779
[837]	validation_0-auc:0.77779
[838]	validation_0-auc:0.77781
[839]	validation_0-auc:0.77781
[840]	validation_0-auc:0.77780
[841]	validation_0-auc:0.77777
[842]	validation_0-auc:0.77777
[843]	validation_0-auc:0.77779
[844]	validation_0-auc:0.77779
[845]	validation_0-auc:0.77779
[846]	validation_0-auc:0.77777
[847]	validation_0-auc:0.77780
[848]	validation_0-auc:0.77781
[849]	validation_0-auc:0.77782
[850]	validation_0-auc:0.77781
[851]	validation_0-auc:0.77781
[852]	validation_0-auc:0.77780
[853]	validation_0-auc:0.77781
[854]	validation_0-auc:0.77780
[855]	validation_0-auc:0.77782
[856]	validation_0-auc:0.77783
[857]	validation_0-auc:0.77785
[858]	validation_0-auc:0.77788
[859]	validation_0-auc:0.77789
[860]	validation_0-auc:0.77789
[861]	validation_0-auc:0.77790
[862]	validation_0-auc:0.77790
[863]	validation_0-auc:0.77790
[864]	validation_0-auc:0.77791
[865]	validation_0-auc:0.77792
[866]	validation_0-auc:0.77794
[867]	validation_0-auc:0.77794
[868]	validation_0-auc:0.77795
[869]	validation_0-auc:0.77795
[870]	validation_0-auc:0.77797
[871]	validation_0-auc:0.77795
[872]	validation_0-auc:0.77796
[873]	validation_0-auc:0.77797
[874]	validation_0-auc:0.77798
[875]	validation_0-auc:0.77798
[876]	validation_0-auc:0.77798
[877]	validation_0-auc:0.77798
[878]	validation_0-auc:0.77800
[879]	validation_0-auc:0.77801
[880]	validation_0-auc:0.77803
[881]	validation_0-auc:0.77802
[882]	validation_0-auc:0.77804
[883]	validation_0-auc:0.77803
[884]	validation_0-auc:0.77804
[885]	validation_0-auc:0.77803
[886]	validation_0-auc:0.77802
[887]	validation_0-auc:0.77803
[888]	validation_0-auc:0.77803
[889]	validation_0-auc:0.77806
[890]	validation_0-auc:0.77804
[891]	validation_0-auc:0.77808
[892]	validation_0-auc:0.77807
[893]	validation_0-auc:0.77807
[894]	validation_0-auc:0.77808
[895]	validation_0-auc:0.77808
[896]	validation_0-auc:0.77808
[897]	validation_0-auc:0.77809
[898]	validation_0-auc:0.77808
[899]	validation_0-auc:0.77809
[900]	validation_0-auc:0.77810
[901]	validation_0-auc:0.77810
[902]	validation_0-auc:0.77810
[903]	validation_0-auc:0.77809
[904]	validation_0-auc:0.77813
[905]	validation_0-auc:0.77815
[906]	validation_0-auc:0.77816
[907]	validation_0-auc:0.77816
[908]	validation_0-auc:0.77818
[909]	validation_0-auc:0.77818
[910]	validation_0-auc:0.77817
[911]	validation_0-auc:0.77816
[912]	validation_0-auc:0.77816
[913]	validation_0-auc:0.77819
[914]	validation_0-auc:0.77819
[915]	validation_0-auc:0.77821
[916]	validation_0-auc:0.77822
[917]	validation_0-auc:0.77822
[918]	validation_0-auc:0.77822
[919]	validation_0-auc:0.77823
[920]	validation_0-auc:0.77827
[921]	validation_0-auc:0.77828
[922]	validation_0-auc:0.77829
[923]	validation_0-auc:0.77831
[924]	validation_0-auc:0.77829
[925]	validation_0-auc:0.77829
[926]	validation_0-auc:0.77831
[927]	validation_0-auc:0.77830
[928]	validation_0-auc:0.77833
[929]	validation_0-auc:0.77834
[930]	validation_0-auc:0.77834
[931]	validation_0-auc:0.77833
[932]	validation_0-auc:0.77833
[933]	validation_0-auc:0.77834
[934]	validation_0-auc:0.77832
[935]	validation_0-auc:0.77833
[936]	validation_0-auc:0.77834
[937]	validation_0-auc:0.77835
[938]	validation_0-auc:0.77836
[939]	validation_0-auc:0.77835
[940]	validation_0-auc:0.77836
[941]	validation_0-auc:0.77838
[942]	validation_0-auc:0.77838
[943]	validation_0-auc:0.77840
[944]	validation_0-auc:0.77841
[945]	validation_0-auc:0.77841
[946]	validation_0-auc:0.77842
[947]	validation_0-auc:0.77842
[948]	validation_0-auc:0.77843
[949]	validation_0-auc:0.77844
[950]	validation_0-auc:0.77846
[951]	validation_0-auc:0.77849
[952]	validation_0-auc:0.77849
[953]	validation_0-auc:0.77848
[954]	validation_0-auc:0.77851
[955]	validation_0-auc:0.77851
[956]	validation_0-auc:0.77852
[957]	validation_0-auc:0.77852
[958]	validation_0-auc:0.77853
[959]	validation_0-auc:0.77848
[960]	validation_0-auc:0.77849
[961]	validation_0-auc:0.77848
[962]	validation_0-auc:0.77848
[963]	validation_0-auc:0.77850
[964]	validation_0-auc:0.77850
[965]	validation_0-auc:0.77853
[966]	validation_0-auc:0.77853
[967]	validation_0-auc:0.77853
[968]	validation_0-auc:0.77854
[969]	validation_0-auc:0.77855
[970]	validation_0-auc:0.77857
[971]	validation_0-auc:0.77858
[972]	validation_0-auc:0.77858
[973]	validation_0-auc:0.77858
[974]	validation_0-auc:0.77860
[975]	validation_0-auc:0.77859
[976]	validation_0-auc:0.77858
[977]	validation_0-auc:0.77858
[978]	validation_0-auc:0.77858
[979]	validation_0-auc:0.77859
[980]	validation_0-auc:0.77860
[981]	validation_0-auc:0.77860
[982]	validation_0-auc:0.77859
[983]	validation_0-auc:0.77862
[984]	validation_0-auc:0.77863
[985]	validation_0-auc:0.77862
[986]	validation_0-auc:0.77862
[987]	validation_0-auc:0.77860
[988]	validation_0-auc:0.77860
[989]	validation_0-auc:0.77860
[990]	validation_0-auc:0.77858
[991]	validation_0-auc:0.77858
[992]	validation_0-auc:0.77860
[993]	validation_0-auc:0.77861
[994]	validation_0-auc:0.77862
[995]	validation_0-auc:0.77864
[996]	validation_0-auc:0.77863
[997]	validation_0-auc:0.77863
[998]	validation_0-auc:0.77863
[999]	validation_0-auc:0.77864
****************************************


----------------------------------------


****************************************


Best Parameters:
	predictor__colsample_bytree: 0.8
	predictor__eta: 0.01
	predictor__max_depth: 8
	predictor__min_child_weight: 35
	predictor__n_estimators: 1000
	predictor__verbose_eval: False
	predictor__verbosity: 0
****** FINISH Best Model Manual Balanced: XGBoost  *****

----------------------------------------
****************************************
----------------------------------------
****** START Best Model Manual Balanced: LightGBM *****
Parameters:
	boosting_type: ['gbdt']
	colsample_bytree: [0.8, 0.95]
	is_unbalance: [True]
	learning_rate: [0.01, 0.05, 0.1]
	min_split_gain: [0.01, 0.02, 0.05]
	n_estimators: [5000, 10000]
	num_leaves: [30, 35]
	reg_alpha: [0.01, 0.02, 0.05]
	reg_lambda: [0.01, 0.02, 0.05]
Fitting 5 folds for each of 648 candidates, totalling 3240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   33.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  9.2min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed: 12.7min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 18.0min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 25.7min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 31.9min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed: 32.2min finished
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[52]	valid's auc: 0.743711	valid's binary_logloss: 0.534737
****************************************


----------------------------------------


****************************************


Best Parameters:
	predictor__boosting_type: gbdt
	predictor__colsample_bytree: 0.8
	predictor__is_unbalance: True
	predictor__learning_rate: 0.01
	predictor__min_split_gain: 0.05
	predictor__n_estimators: 5000
	predictor__num_leaves: 35
	predictor__reg_alpha: 0.01
	predictor__reg_lambda: 0.05
****** FINISH Best Model Manual Balanced: LightGBM  *****

----------------------------------------
****************************************
----------------------------------------
Model Train Accuracy Score Train AUC Score Test Accuracy Score Test AUC Score Training Time Experiments Description
0 Best Model Manual Balanced:XGBoost 0.840 0.893 0.781 0.779 75.7185 360 [["predictor__colsample_bytree", 0.8], ["predi...
1 Best Model Manual Balanced:LightGBM 0.972 0.998 0.756 0.771 32.2073 3240 [["predictor__boosting_type", "gbdt"], ["predi...
CPU times: user 14min 49s, sys: 16.2 s, total: 15min 5s
Wall time: 2h 40min 46s

Manually Balanced Data Set (Ratio = 0.33)

In [22]:
df_balanced = manual_balancing(df_train, ratio=0.33)
X_bal = df_balanced.drop('TARGET', axis = 1).values
y_bal = df_balanced['TARGET'].values
df_balanced['TARGET'].value_counts(normalize = True).round(2).plot.pie(autopct='%1.1f%%')
Out[22]:
<AxesSubplot:ylabel='TARGET'>
In [23]:
X_train, X_test, y_train, y_test = train_test_split(X_bal, y_bal , stratify = y_bal, 
                                                    test_size = 0.3, random_state = 42)
In [24]:
%%time
# This might take a while
if __name__ == "__main__":

    ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model Manual Balanced 50%:",  n_jobs=-1,verbose=1)
****** START Best Model Manual Balanced 50%: XGBoost *****
Parameters:
	colsample_bytree: [0.8, 0.95]
	eta: [0.01, 0.05, 0.1]
	max_depth: [3, 6, 8]
	min_child_weight: [20, 35]
	n_estimators: [500, 1000]
	verbose_eval: [False]
	verbosity: [0]
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed: 19.8min
[Parallel(n_jobs=-1)]: Done 188 tasks      | elapsed: 65.2min
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed: 131.6min finished
[0]	validation_0-auc:0.70888
Will train until validation_0-auc hasn't improved in 50 rounds.
[1]	validation_0-auc:0.71108
[2]	validation_0-auc:0.71552
[3]	validation_0-auc:0.71775
[4]	validation_0-auc:0.71947
[5]	validation_0-auc:0.72097
[6]	validation_0-auc:0.72196
[7]	validation_0-auc:0.72189
[8]	validation_0-auc:0.72287
[9]	validation_0-auc:0.72345
[10]	validation_0-auc:0.72386
[11]	validation_0-auc:0.72432
[12]	validation_0-auc:0.72474
[13]	validation_0-auc:0.72497
[14]	validation_0-auc:0.72613
[15]	validation_0-auc:0.72624
[16]	validation_0-auc:0.72737
[17]	validation_0-auc:0.72791
[18]	validation_0-auc:0.72822
[19]	validation_0-auc:0.72894
[20]	validation_0-auc:0.73035
[21]	validation_0-auc:0.73092
[22]	validation_0-auc:0.73103
[23]	validation_0-auc:0.73125
[24]	validation_0-auc:0.73182
[25]	validation_0-auc:0.73196
[26]	validation_0-auc:0.73245
[27]	validation_0-auc:0.73273
[28]	validation_0-auc:0.73311
[29]	validation_0-auc:0.73342
[30]	validation_0-auc:0.73397
[31]	validation_0-auc:0.73424
[32]	validation_0-auc:0.73471
[33]	validation_0-auc:0.73496
[34]	validation_0-auc:0.73603
[35]	validation_0-auc:0.73663
[36]	validation_0-auc:0.73721
[37]	validation_0-auc:0.73806
[38]	validation_0-auc:0.73830
[39]	validation_0-auc:0.73926
[40]	validation_0-auc:0.73970
[41]	validation_0-auc:0.74025
[42]	validation_0-auc:0.74079
[43]	validation_0-auc:0.74142
[44]	validation_0-auc:0.74186
[45]	validation_0-auc:0.74256
[46]	validation_0-auc:0.74327
[47]	validation_0-auc:0.74406
[48]	validation_0-auc:0.74452
[49]	validation_0-auc:0.74497
[50]	validation_0-auc:0.74546
[51]	validation_0-auc:0.74623
[52]	validation_0-auc:0.74669
[53]	validation_0-auc:0.74730
[54]	validation_0-auc:0.74768
[55]	validation_0-auc:0.74818
[56]	validation_0-auc:0.74869
[57]	validation_0-auc:0.74897
[58]	validation_0-auc:0.74951
[59]	validation_0-auc:0.74986
[60]	validation_0-auc:0.75010
[61]	validation_0-auc:0.75061
[62]	validation_0-auc:0.75107
[63]	validation_0-auc:0.75116
[64]	validation_0-auc:0.75148
[65]	validation_0-auc:0.75184
[66]	validation_0-auc:0.75214
[67]	validation_0-auc:0.75253
[68]	validation_0-auc:0.75291
[69]	validation_0-auc:0.75336
[70]	validation_0-auc:0.75390
[71]	validation_0-auc:0.75421
[72]	validation_0-auc:0.75458
[73]	validation_0-auc:0.75485
[74]	validation_0-auc:0.75534
[75]	validation_0-auc:0.75542
[76]	validation_0-auc:0.75563
[77]	validation_0-auc:0.75601
[78]	validation_0-auc:0.75637
[79]	validation_0-auc:0.75651
[80]	validation_0-auc:0.75681
[81]	validation_0-auc:0.75702
[82]	validation_0-auc:0.75725
[83]	validation_0-auc:0.75753
[84]	validation_0-auc:0.75768
[85]	validation_0-auc:0.75785
[86]	validation_0-auc:0.75811
[87]	validation_0-auc:0.75828
[88]	validation_0-auc:0.75853
[89]	validation_0-auc:0.75872
[90]	validation_0-auc:0.75887
[91]	validation_0-auc:0.75903
[92]	validation_0-auc:0.75931
[93]	validation_0-auc:0.75942
[94]	validation_0-auc:0.75969
[95]	validation_0-auc:0.75984
[96]	validation_0-auc:0.76003
[97]	validation_0-auc:0.76027
[98]	validation_0-auc:0.76052
[99]	validation_0-auc:0.76069
[100]	validation_0-auc:0.76084
[101]	validation_0-auc:0.76097
[102]	validation_0-auc:0.76111
[103]	validation_0-auc:0.76125
[104]	validation_0-auc:0.76136
[105]	validation_0-auc:0.76146
[106]	validation_0-auc:0.76161
[107]	validation_0-auc:0.76180
[108]	validation_0-auc:0.76184
[109]	validation_0-auc:0.76194
[110]	validation_0-auc:0.76209
[111]	validation_0-auc:0.76217
[112]	validation_0-auc:0.76228
[113]	validation_0-auc:0.76243
[114]	validation_0-auc:0.76262
[115]	validation_0-auc:0.76283
[116]	validation_0-auc:0.76301
[117]	validation_0-auc:0.76311
[118]	validation_0-auc:0.76320
[119]	validation_0-auc:0.76330
[120]	validation_0-auc:0.76344
[121]	validation_0-auc:0.76357
[122]	validation_0-auc:0.76377
[123]	validation_0-auc:0.76384
[124]	validation_0-auc:0.76411
[125]	validation_0-auc:0.76413
[126]	validation_0-auc:0.76428
[127]	validation_0-auc:0.76440
[128]	validation_0-auc:0.76448
[129]	validation_0-auc:0.76464
[130]	validation_0-auc:0.76470
[131]	validation_0-auc:0.76478
[132]	validation_0-auc:0.76485
[133]	validation_0-auc:0.76492
[134]	validation_0-auc:0.76507
[135]	validation_0-auc:0.76519
[136]	validation_0-auc:0.76521
[137]	validation_0-auc:0.76535
[138]	validation_0-auc:0.76543
[139]	validation_0-auc:0.76547
[140]	validation_0-auc:0.76555
[141]	validation_0-auc:0.76558
[142]	validation_0-auc:0.76572
[143]	validation_0-auc:0.76589
[144]	validation_0-auc:0.76597
[145]	validation_0-auc:0.76601
[146]	validation_0-auc:0.76611
[147]	validation_0-auc:0.76618
[148]	validation_0-auc:0.76621
[149]	validation_0-auc:0.76630
[150]	validation_0-auc:0.76644
[151]	validation_0-auc:0.76646
[152]	validation_0-auc:0.76655
[153]	validation_0-auc:0.76664
[154]	validation_0-auc:0.76672
[155]	validation_0-auc:0.76680
[156]	validation_0-auc:0.76692
[157]	validation_0-auc:0.76692
[158]	validation_0-auc:0.76702
[159]	validation_0-auc:0.76705
[160]	validation_0-auc:0.76713
[161]	validation_0-auc:0.76721
[162]	validation_0-auc:0.76732
[163]	validation_0-auc:0.76736
[164]	validation_0-auc:0.76753
[165]	validation_0-auc:0.76754
[166]	validation_0-auc:0.76763
[167]	validation_0-auc:0.76775
[168]	validation_0-auc:0.76779
[169]	validation_0-auc:0.76786
[170]	validation_0-auc:0.76795
[171]	validation_0-auc:0.76805
[172]	validation_0-auc:0.76813
[173]	validation_0-auc:0.76827
[174]	validation_0-auc:0.76835
[175]	validation_0-auc:0.76839
[176]	validation_0-auc:0.76842
[177]	validation_0-auc:0.76844
[178]	validation_0-auc:0.76855
[179]	validation_0-auc:0.76855
[180]	validation_0-auc:0.76857
[181]	validation_0-auc:0.76860
[182]	validation_0-auc:0.76882
[183]	validation_0-auc:0.76899
[184]	validation_0-auc:0.76905
[185]	validation_0-auc:0.76911
[186]	validation_0-auc:0.76913
[187]	validation_0-auc:0.76930
[188]	validation_0-auc:0.76932
[189]	validation_0-auc:0.76940
[190]	validation_0-auc:0.76946
[191]	validation_0-auc:0.76950
[192]	validation_0-auc:0.76948
[193]	validation_0-auc:0.76954
[194]	validation_0-auc:0.76953
[195]	validation_0-auc:0.76962
[196]	validation_0-auc:0.76965
[197]	validation_0-auc:0.76970
[198]	validation_0-auc:0.76979
[199]	validation_0-auc:0.76977
[200]	validation_0-auc:0.76983
[201]	validation_0-auc:0.76997
[202]	validation_0-auc:0.76999
[203]	validation_0-auc:0.77002
[204]	validation_0-auc:0.77014
[205]	validation_0-auc:0.77016
[206]	validation_0-auc:0.77014
[207]	validation_0-auc:0.77023
[208]	validation_0-auc:0.77039
[209]	validation_0-auc:0.77056
[210]	validation_0-auc:0.77059
[211]	validation_0-auc:0.77063
[212]	validation_0-auc:0.77074
[213]	validation_0-auc:0.77082
[214]	validation_0-auc:0.77094
[215]	validation_0-auc:0.77102
[216]	validation_0-auc:0.77100
[217]	validation_0-auc:0.77109
[218]	validation_0-auc:0.77113
[219]	validation_0-auc:0.77114
[220]	validation_0-auc:0.77122
[221]	validation_0-auc:0.77124
[222]	validation_0-auc:0.77141
[223]	validation_0-auc:0.77142
[224]	validation_0-auc:0.77145
[225]	validation_0-auc:0.77157
[226]	validation_0-auc:0.77158
[227]	validation_0-auc:0.77167
[228]	validation_0-auc:0.77165
[229]	validation_0-auc:0.77167
[230]	validation_0-auc:0.77178
[231]	validation_0-auc:0.77179
[232]	validation_0-auc:0.77179
[233]	validation_0-auc:0.77192
[234]	validation_0-auc:0.77199
[235]	validation_0-auc:0.77207
[236]	validation_0-auc:0.77226
[237]	validation_0-auc:0.77231
[238]	validation_0-auc:0.77236
[239]	validation_0-auc:0.77239
[240]	validation_0-auc:0.77239
[241]	validation_0-auc:0.77244
[242]	validation_0-auc:0.77247
[243]	validation_0-auc:0.77254
[244]	validation_0-auc:0.77247
[245]	validation_0-auc:0.77252
[246]	validation_0-auc:0.77252
[247]	validation_0-auc:0.77261
[248]	validation_0-auc:0.77267
[249]	validation_0-auc:0.77270
[250]	validation_0-auc:0.77273
[251]	validation_0-auc:0.77276
[252]	validation_0-auc:0.77276
[253]	validation_0-auc:0.77286
[254]	validation_0-auc:0.77296
[255]	validation_0-auc:0.77301
[256]	validation_0-auc:0.77300
[257]	validation_0-auc:0.77298
[258]	validation_0-auc:0.77314
[259]	validation_0-auc:0.77315
[260]	validation_0-auc:0.77314
[261]	validation_0-auc:0.77313
[262]	validation_0-auc:0.77319
[263]	validation_0-auc:0.77321
[264]	validation_0-auc:0.77323
[265]	validation_0-auc:0.77325
[266]	validation_0-auc:0.77330
[267]	validation_0-auc:0.77337
[268]	validation_0-auc:0.77363
[269]	validation_0-auc:0.77373
[270]	validation_0-auc:0.77376
[271]	validation_0-auc:0.77375
[272]	validation_0-auc:0.77375
[273]	validation_0-auc:0.77380
[274]	validation_0-auc:0.77384
[275]	validation_0-auc:0.77390
[276]	validation_0-auc:0.77396
[277]	validation_0-auc:0.77394
[278]	validation_0-auc:0.77397
[279]	validation_0-auc:0.77402
[280]	validation_0-auc:0.77417
[281]	validation_0-auc:0.77419
[282]	validation_0-auc:0.77420
[283]	validation_0-auc:0.77421
[284]	validation_0-auc:0.77421
[285]	validation_0-auc:0.77419
[286]	validation_0-auc:0.77426
[287]	validation_0-auc:0.77427
[288]	validation_0-auc:0.77428
[289]	validation_0-auc:0.77426
[290]	validation_0-auc:0.77422
[291]	validation_0-auc:0.77422
[292]	validation_0-auc:0.77430
[293]	validation_0-auc:0.77434
[294]	validation_0-auc:0.77435
[295]	validation_0-auc:0.77436
[296]	validation_0-auc:0.77433
[297]	validation_0-auc:0.77435
[298]	validation_0-auc:0.77446
[299]	validation_0-auc:0.77447
[300]	validation_0-auc:0.77450
[301]	validation_0-auc:0.77456
[302]	validation_0-auc:0.77457
[303]	validation_0-auc:0.77455
[304]	validation_0-auc:0.77456
[305]	validation_0-auc:0.77465
[306]	validation_0-auc:0.77467
[307]	validation_0-auc:0.77468
[308]	validation_0-auc:0.77471
[309]	validation_0-auc:0.77474
[310]	validation_0-auc:0.77479
[311]	validation_0-auc:0.77476
[312]	validation_0-auc:0.77484
[313]	validation_0-auc:0.77485
[314]	validation_0-auc:0.77488
[315]	validation_0-auc:0.77489
[316]	validation_0-auc:0.77481
[317]	validation_0-auc:0.77481
[318]	validation_0-auc:0.77483
[319]	validation_0-auc:0.77480
[320]	validation_0-auc:0.77483
[321]	validation_0-auc:0.77484
[322]	validation_0-auc:0.77482
[323]	validation_0-auc:0.77483
[324]	validation_0-auc:0.77484
[325]	validation_0-auc:0.77481
[326]	validation_0-auc:0.77481
[327]	validation_0-auc:0.77480
[328]	validation_0-auc:0.77482
[329]	validation_0-auc:0.77482
[330]	validation_0-auc:0.77487
[331]	validation_0-auc:0.77486
[332]	validation_0-auc:0.77488
[333]	validation_0-auc:0.77490
[334]	validation_0-auc:0.77488
[335]	validation_0-auc:0.77480
[336]	validation_0-auc:0.77479
[337]	validation_0-auc:0.77488
[338]	validation_0-auc:0.77487
[339]	validation_0-auc:0.77486
[340]	validation_0-auc:0.77496
[341]	validation_0-auc:0.77500
[342]	validation_0-auc:0.77498
[343]	validation_0-auc:0.77499
[344]	validation_0-auc:0.77502
[345]	validation_0-auc:0.77499
[346]	validation_0-auc:0.77496
[347]	validation_0-auc:0.77506
[348]	validation_0-auc:0.77506
[349]	validation_0-auc:0.77508
[350]	validation_0-auc:0.77507
[351]	validation_0-auc:0.77510
[352]	validation_0-auc:0.77507
[353]	validation_0-auc:0.77507
[354]	validation_0-auc:0.77514
[355]	validation_0-auc:0.77516
[356]	validation_0-auc:0.77518
[357]	validation_0-auc:0.77515
[358]	validation_0-auc:0.77516
[359]	validation_0-auc:0.77521
[360]	validation_0-auc:0.77517
[361]	validation_0-auc:0.77526
[362]	validation_0-auc:0.77525
[363]	validation_0-auc:0.77516
[364]	validation_0-auc:0.77517
[365]	validation_0-auc:0.77519
[366]	validation_0-auc:0.77524
[367]	validation_0-auc:0.77529
[368]	validation_0-auc:0.77528
[369]	validation_0-auc:0.77529
[370]	validation_0-auc:0.77523
[371]	validation_0-auc:0.77524
[372]	validation_0-auc:0.77522
[373]	validation_0-auc:0.77520
[374]	validation_0-auc:0.77523
[375]	validation_0-auc:0.77525
[376]	validation_0-auc:0.77536
[377]	validation_0-auc:0.77537
[378]	validation_0-auc:0.77539
[379]	validation_0-auc:0.77544
[380]	validation_0-auc:0.77546
[381]	validation_0-auc:0.77549
[382]	validation_0-auc:0.77551
[383]	validation_0-auc:0.77547
[384]	validation_0-auc:0.77546
[385]	validation_0-auc:0.77556
[386]	validation_0-auc:0.77561
[387]	validation_0-auc:0.77565
[388]	validation_0-auc:0.77561
[389]	validation_0-auc:0.77556
[390]	validation_0-auc:0.77554
[391]	validation_0-auc:0.77555
[392]	validation_0-auc:0.77559
[393]	validation_0-auc:0.77557
[394]	validation_0-auc:0.77563
[395]	validation_0-auc:0.77568
[396]	validation_0-auc:0.77567
[397]	validation_0-auc:0.77567
[398]	validation_0-auc:0.77564
[399]	validation_0-auc:0.77564
[400]	validation_0-auc:0.77567
[401]	validation_0-auc:0.77564
[402]	validation_0-auc:0.77568
[403]	validation_0-auc:0.77569
[404]	validation_0-auc:0.77576
[405]	validation_0-auc:0.77580
[406]	validation_0-auc:0.77576
[407]	validation_0-auc:0.77577
[408]	validation_0-auc:0.77580
[409]	validation_0-auc:0.77579
[410]	validation_0-auc:0.77577
[411]	validation_0-auc:0.77576
[412]	validation_0-auc:0.77575
[413]	validation_0-auc:0.77578
[414]	validation_0-auc:0.77581
[415]	validation_0-auc:0.77585
[416]	validation_0-auc:0.77585
[417]	validation_0-auc:0.77589
[418]	validation_0-auc:0.77593
[419]	validation_0-auc:0.77591
[420]	validation_0-auc:0.77590
[421]	validation_0-auc:0.77589
[422]	validation_0-auc:0.77584
[423]	validation_0-auc:0.77586
[424]	validation_0-auc:0.77590
[425]	validation_0-auc:0.77587
[426]	validation_0-auc:0.77596
[427]	validation_0-auc:0.77602
[428]	validation_0-auc:0.77607
[429]	validation_0-auc:0.77607
[430]	validation_0-auc:0.77612
[431]	validation_0-auc:0.77613
[432]	validation_0-auc:0.77607
[433]	validation_0-auc:0.77608
[434]	validation_0-auc:0.77604
[435]	validation_0-auc:0.77601
[436]	validation_0-auc:0.77600
[437]	validation_0-auc:0.77598
[438]	validation_0-auc:0.77598
[439]	validation_0-auc:0.77592
[440]	validation_0-auc:0.77596
[441]	validation_0-auc:0.77599
[442]	validation_0-auc:0.77606
[443]	validation_0-auc:0.77604
[444]	validation_0-auc:0.77607
[445]	validation_0-auc:0.77607
[446]	validation_0-auc:0.77611
[447]	validation_0-auc:0.77614
[448]	validation_0-auc:0.77614
[449]	validation_0-auc:0.77615
[450]	validation_0-auc:0.77616
[451]	validation_0-auc:0.77613
[452]	validation_0-auc:0.77614
[453]	validation_0-auc:0.77610
[454]	validation_0-auc:0.77607
[455]	validation_0-auc:0.77604
[456]	validation_0-auc:0.77604
[457]	validation_0-auc:0.77601
[458]	validation_0-auc:0.77605
[459]	validation_0-auc:0.77603
[460]	validation_0-auc:0.77609
[461]	validation_0-auc:0.77609
[462]	validation_0-auc:0.77609
[463]	validation_0-auc:0.77608
[464]	validation_0-auc:0.77606
[465]	validation_0-auc:0.77607
[466]	validation_0-auc:0.77605
[467]	validation_0-auc:0.77604
[468]	validation_0-auc:0.77604
[469]	validation_0-auc:0.77601
[470]	validation_0-auc:0.77608
[471]	validation_0-auc:0.77606
[472]	validation_0-auc:0.77614
[473]	validation_0-auc:0.77617
[474]	validation_0-auc:0.77613
[475]	validation_0-auc:0.77607
[476]	validation_0-auc:0.77616
[477]	validation_0-auc:0.77613
[478]	validation_0-auc:0.77616
[479]	validation_0-auc:0.77617
[480]	validation_0-auc:0.77616
[481]	validation_0-auc:0.77614
[482]	validation_0-auc:0.77611
[483]	validation_0-auc:0.77610
[484]	validation_0-auc:0.77615
[485]	validation_0-auc:0.77610
[486]	validation_0-auc:0.77609
[487]	validation_0-auc:0.77608
[488]	validation_0-auc:0.77615
[489]	validation_0-auc:0.77619
[490]	validation_0-auc:0.77621
[491]	validation_0-auc:0.77618
[492]	validation_0-auc:0.77618
[493]	validation_0-auc:0.77613
[494]	validation_0-auc:0.77612
[495]	validation_0-auc:0.77615
[496]	validation_0-auc:0.77615
[497]	validation_0-auc:0.77613
[498]	validation_0-auc:0.77614
[499]	validation_0-auc:0.77614
[500]	validation_0-auc:0.77612
[501]	validation_0-auc:0.77618
[502]	validation_0-auc:0.77616
[503]	validation_0-auc:0.77616
[504]	validation_0-auc:0.77616
[505]	validation_0-auc:0.77619
[506]	validation_0-auc:0.77623
[507]	validation_0-auc:0.77626
[508]	validation_0-auc:0.77631
[509]	validation_0-auc:0.77634
[510]	validation_0-auc:0.77637
[511]	validation_0-auc:0.77638
[512]	validation_0-auc:0.77637
[513]	validation_0-auc:0.77636
[514]	validation_0-auc:0.77636
[515]	validation_0-auc:0.77634
[516]	validation_0-auc:0.77637
[517]	validation_0-auc:0.77635
[518]	validation_0-auc:0.77636
[519]	validation_0-auc:0.77638
[520]	validation_0-auc:0.77637
[521]	validation_0-auc:0.77635
[522]	validation_0-auc:0.77636
[523]	validation_0-auc:0.77637
[524]	validation_0-auc:0.77634
[525]	validation_0-auc:0.77633
[526]	validation_0-auc:0.77629
[527]	validation_0-auc:0.77628
[528]	validation_0-auc:0.77631
[529]	validation_0-auc:0.77629
[530]	validation_0-auc:0.77631
[531]	validation_0-auc:0.77621
[532]	validation_0-auc:0.77621
[533]	validation_0-auc:0.77621
[534]	validation_0-auc:0.77616
[535]	validation_0-auc:0.77617
[536]	validation_0-auc:0.77618
[537]	validation_0-auc:0.77619
[538]	validation_0-auc:0.77619
[539]	validation_0-auc:0.77619
[540]	validation_0-auc:0.77620
[541]	validation_0-auc:0.77624
[542]	validation_0-auc:0.77625
[543]	validation_0-auc:0.77626
[544]	validation_0-auc:0.77625
[545]	validation_0-auc:0.77624
[546]	validation_0-auc:0.77626
[547]	validation_0-auc:0.77627
[548]	validation_0-auc:0.77627
[549]	validation_0-auc:0.77622
[550]	validation_0-auc:0.77623
[551]	validation_0-auc:0.77625
[552]	validation_0-auc:0.77624
[553]	validation_0-auc:0.77616
[554]	validation_0-auc:0.77615
[555]	validation_0-auc:0.77613
[556]	validation_0-auc:0.77611
[557]	validation_0-auc:0.77616
[558]	validation_0-auc:0.77615
[559]	validation_0-auc:0.77614
[560]	validation_0-auc:0.77613
[561]	validation_0-auc:0.77616
Stopping. Best iteration:
[511]	validation_0-auc:0.77638

****************************************


----------------------------------------


****************************************


Best Parameters:
	predictor__colsample_bytree: 0.95
	predictor__eta: 0.05
	predictor__max_depth: 3
	predictor__min_child_weight: 35
	predictor__n_estimators: 1000
	predictor__verbose_eval: False
	predictor__verbosity: 0
****** FINISH Best Model Manual Balanced 50%: XGBoost  *****

----------------------------------------
****************************************
----------------------------------------
****** START Best Model Manual Balanced 50%: LightGBM *****
Parameters:
	boosting_type: ['gbdt']
	colsample_bytree: [0.8, 0.95]
	is_unbalance: [True]
	learning_rate: [0.01, 0.05, 0.1]
	min_split_gain: [0.01, 0.02, 0.05]
	n_estimators: [5000, 10000]
	num_leaves: [30, 35]
	reg_alpha: [0.01, 0.02, 0.05]
	reg_lambda: [0.01, 0.02, 0.05]
Fitting 5 folds for each of 648 candidates, totalling 3240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 188 tasks      | elapsed: 26.2min
[Parallel(n_jobs=-1)]: Done 438 tasks      | elapsed: 61.3min
[Parallel(n_jobs=-1)]: Done 788 tasks      | elapsed: 83.8min
[Parallel(n_jobs=-1)]: Done 1238 tasks      | elapsed: 97.1min
[Parallel(n_jobs=-1)]: Done 1788 tasks      | elapsed: 130.8min
[Parallel(n_jobs=-1)]: Done 2438 tasks      | elapsed: 197.1min
[Parallel(n_jobs=-1)]: Done 3188 tasks      | elapsed: 218.6min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed: 219.8min finished
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[961]	valid's auc: 0.775426	valid's binary_logloss: 0.553734
****************************************


----------------------------------------


****************************************


Best Parameters:
	predictor__boosting_type: gbdt
	predictor__colsample_bytree: 0.8
	predictor__is_unbalance: True
	predictor__learning_rate: 0.01
	predictor__min_split_gain: 0.05
	predictor__n_estimators: 5000
	predictor__num_leaves: 35
	predictor__reg_alpha: 0.02
	predictor__reg_lambda: 0.02
****** FINISH Best Model Manual Balanced 50%: LightGBM  *****

----------------------------------------
****************************************
----------------------------------------
Model Train Accuracy Score Train AUC Score Test Accuracy Score Test AUC Score Training Time Experiments Description
0 Best Model Manual Balanced 50%:XGBoost 0.774 0.825 0.736 0.770 35.3433 360 [["predictor__colsample_bytree", 0.95], ["pred...
1 Best Model Manual Balanced 50%:LightGBM 0.986 0.999 0.727 0.765 56.6256 3240 [["predictor__boosting_type", "gbdt"], ["predi...
CPU times: user 15min 24s, sys: 7.52 s, total: 15min 32s
Wall time: 5h 53min 57s

Manually Balanced Data Set (Ratio = 0.5 or 1:1)

In [25]:
df_balanced = manual_balancing(df_train, ratio=0.5)
X_bal = df_balanced.drop('TARGET', axis = 1).values
y_bal = df_balanced['TARGET'].values
df_balanced['TARGET'].value_counts(normalize = True).round(2).plot.pie(autopct='%1.1f%%')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8175e25080>
In [26]:
X_train, X_test, y_train, y_test = train_test_split(X_bal, y_bal , stratify = y_bal, 
                                                    test_size = 0.3, random_state = 42)
In [27]:
%%time
# This might take a while
if __name__ == "__main__":

    ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model Manual Balanced 1:1",  n_jobs=-1,verbose=1)
    
****** START Best Model Manual Balanced 1:1 XGBoost *****
Parameters:
	colsample_bytree: [0.8, 0.95]
	eta: [0.01, 0.05, 0.1]
	max_depth: [3, 6, 8]
	min_child_weight: [20, 35]
	n_estimators: [500, 1000]
	verbose_eval: [False]
	verbosity: [0]
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed: 10.1min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 33.3min
[0]	validation_0-auc:0.70487
Will train until validation_0-auc hasn't improved in 50 rounds.
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed: 68.9min finished
[1]	validation_0-auc:0.70982
[2]	validation_0-auc:0.71155
[3]	validation_0-auc:0.71178
[4]	validation_0-auc:0.71204
[5]	validation_0-auc:0.71383
[6]	validation_0-auc:0.71535
[7]	validation_0-auc:0.71676
[8]	validation_0-auc:0.71684
[9]	validation_0-auc:0.71704
[10]	validation_0-auc:0.71689
[11]	validation_0-auc:0.71749
[12]	validation_0-auc:0.71739
[13]	validation_0-auc:0.71805
[14]	validation_0-auc:0.71856
[15]	validation_0-auc:0.71949
[16]	validation_0-auc:0.71954
[17]	validation_0-auc:0.72023
[18]	validation_0-auc:0.72099
[19]	validation_0-auc:0.72088
[20]	validation_0-auc:0.72163
[21]	validation_0-auc:0.72194
[22]	validation_0-auc:0.72239
[23]	validation_0-auc:0.72273
[24]	validation_0-auc:0.72356
[25]	validation_0-auc:0.72377
[26]	validation_0-auc:0.72446
[27]	validation_0-auc:0.72501
[28]	validation_0-auc:0.72569
[29]	validation_0-auc:0.72620
[30]	validation_0-auc:0.72659
[31]	validation_0-auc:0.72692
[32]	validation_0-auc:0.72785
[33]	validation_0-auc:0.72889
[34]	validation_0-auc:0.72960
[35]	validation_0-auc:0.72996
[36]	validation_0-auc:0.73084
[37]	validation_0-auc:0.73150
[38]	validation_0-auc:0.73214
[39]	validation_0-auc:0.73268
[40]	validation_0-auc:0.73311
[41]	validation_0-auc:0.73348
[42]	validation_0-auc:0.73400
[43]	validation_0-auc:0.73471
[44]	validation_0-auc:0.73534
[45]	validation_0-auc:0.73627
[46]	validation_0-auc:0.73672
[47]	validation_0-auc:0.73727
[48]	validation_0-auc:0.73764
[49]	validation_0-auc:0.73817
[50]	validation_0-auc:0.73892
[51]	validation_0-auc:0.73969
[52]	validation_0-auc:0.73996
[53]	validation_0-auc:0.74073
[54]	validation_0-auc:0.74123
[55]	validation_0-auc:0.74149
[56]	validation_0-auc:0.74179
[57]	validation_0-auc:0.74243
[58]	validation_0-auc:0.74328
[59]	validation_0-auc:0.74346
[60]	validation_0-auc:0.74412
[61]	validation_0-auc:0.74441
[62]	validation_0-auc:0.74475
[63]	validation_0-auc:0.74541
[64]	validation_0-auc:0.74579
[65]	validation_0-auc:0.74633
[66]	validation_0-auc:0.74666
[67]	validation_0-auc:0.74683
[68]	validation_0-auc:0.74725
[69]	validation_0-auc:0.74753
[70]	validation_0-auc:0.74781
[71]	validation_0-auc:0.74803
[72]	validation_0-auc:0.74833
[73]	validation_0-auc:0.74857
[74]	validation_0-auc:0.74888
[75]	validation_0-auc:0.74918
[76]	validation_0-auc:0.74944
[77]	validation_0-auc:0.74965
[78]	validation_0-auc:0.74992
[79]	validation_0-auc:0.75004
[80]	validation_0-auc:0.75043
[81]	validation_0-auc:0.75081
[82]	validation_0-auc:0.75103
[83]	validation_0-auc:0.75137
[84]	validation_0-auc:0.75163
[85]	validation_0-auc:0.75190
[86]	validation_0-auc:0.75214
[87]	validation_0-auc:0.75249
[88]	validation_0-auc:0.75274
[89]	validation_0-auc:0.75308
[90]	validation_0-auc:0.75340
[91]	validation_0-auc:0.75347
[92]	validation_0-auc:0.75377
[93]	validation_0-auc:0.75382
[94]	validation_0-auc:0.75394
[95]	validation_0-auc:0.75404
[96]	validation_0-auc:0.75434
[97]	validation_0-auc:0.75463
[98]	validation_0-auc:0.75491
[99]	validation_0-auc:0.75498
[100]	validation_0-auc:0.75507
[101]	validation_0-auc:0.75545
[102]	validation_0-auc:0.75555
[103]	validation_0-auc:0.75579
[104]	validation_0-auc:0.75582
[105]	validation_0-auc:0.75596
[106]	validation_0-auc:0.75604
[107]	validation_0-auc:0.75632
[108]	validation_0-auc:0.75652
[109]	validation_0-auc:0.75685
[110]	validation_0-auc:0.75694
[111]	validation_0-auc:0.75703
[112]	validation_0-auc:0.75723
[113]	validation_0-auc:0.75748
[114]	validation_0-auc:0.75782
[115]	validation_0-auc:0.75794
[116]	validation_0-auc:0.75803
[117]	validation_0-auc:0.75802
[118]	validation_0-auc:0.75814
[119]	validation_0-auc:0.75840
[120]	validation_0-auc:0.75836
[121]	validation_0-auc:0.75854
[122]	validation_0-auc:0.75876
[123]	validation_0-auc:0.75876
[124]	validation_0-auc:0.75898
[125]	validation_0-auc:0.75896
[126]	validation_0-auc:0.75911
[127]	validation_0-auc:0.75919
[128]	validation_0-auc:0.75929
[129]	validation_0-auc:0.75945
[130]	validation_0-auc:0.75938
[131]	validation_0-auc:0.75953
[132]	validation_0-auc:0.75948
[133]	validation_0-auc:0.75954
[134]	validation_0-auc:0.75954
[135]	validation_0-auc:0.75964
[136]	validation_0-auc:0.75969
[137]	validation_0-auc:0.75967
[138]	validation_0-auc:0.75981
[139]	validation_0-auc:0.75989
[140]	validation_0-auc:0.76001
[141]	validation_0-auc:0.76013
[142]	validation_0-auc:0.76026
[143]	validation_0-auc:0.76031
[144]	validation_0-auc:0.76040
[145]	validation_0-auc:0.76041
[146]	validation_0-auc:0.76056
[147]	validation_0-auc:0.76050
[148]	validation_0-auc:0.76055
[149]	validation_0-auc:0.76057
[150]	validation_0-auc:0.76056
[151]	validation_0-auc:0.76061
[152]	validation_0-auc:0.76068
[153]	validation_0-auc:0.76071
[154]	validation_0-auc:0.76079
[155]	validation_0-auc:0.76094
[156]	validation_0-auc:0.76100
[157]	validation_0-auc:0.76110
[158]	validation_0-auc:0.76124
[159]	validation_0-auc:0.76127
[160]	validation_0-auc:0.76116
[161]	validation_0-auc:0.76132
[162]	validation_0-auc:0.76130
[163]	validation_0-auc:0.76150
[164]	validation_0-auc:0.76149
[165]	validation_0-auc:0.76156
[166]	validation_0-auc:0.76152
[167]	validation_0-auc:0.76156
[168]	validation_0-auc:0.76165
[169]	validation_0-auc:0.76176
[170]	validation_0-auc:0.76175
[171]	validation_0-auc:0.76175
[172]	validation_0-auc:0.76181
[173]	validation_0-auc:0.76199
[174]	validation_0-auc:0.76198
[175]	validation_0-auc:0.76209
[176]	validation_0-auc:0.76216
[177]	validation_0-auc:0.76223
[178]	validation_0-auc:0.76224
[179]	validation_0-auc:0.76239
[180]	validation_0-auc:0.76250
[181]	validation_0-auc:0.76262
[182]	validation_0-auc:0.76270
[183]	validation_0-auc:0.76275
[184]	validation_0-auc:0.76289
[185]	validation_0-auc:0.76301
[186]	validation_0-auc:0.76310
[187]	validation_0-auc:0.76309
[188]	validation_0-auc:0.76318
[189]	validation_0-auc:0.76323
[190]	validation_0-auc:0.76318
[191]	validation_0-auc:0.76331
[192]	validation_0-auc:0.76329
[193]	validation_0-auc:0.76334
[194]	validation_0-auc:0.76336
[195]	validation_0-auc:0.76335
[196]	validation_0-auc:0.76334
[197]	validation_0-auc:0.76338
[198]	validation_0-auc:0.76349
[199]	validation_0-auc:0.76359
[200]	validation_0-auc:0.76354
[201]	validation_0-auc:0.76352
[202]	validation_0-auc:0.76357
[203]	validation_0-auc:0.76358
[204]	validation_0-auc:0.76344
[205]	validation_0-auc:0.76348
[206]	validation_0-auc:0.76351
[207]	validation_0-auc:0.76352
[208]	validation_0-auc:0.76355
[209]	validation_0-auc:0.76363
[210]	validation_0-auc:0.76369
[211]	validation_0-auc:0.76380
[212]	validation_0-auc:0.76385
[213]	validation_0-auc:0.76388
[214]	validation_0-auc:0.76385
[215]	validation_0-auc:0.76392
[216]	validation_0-auc:0.76403
[217]	validation_0-auc:0.76391
[218]	validation_0-auc:0.76392
[219]	validation_0-auc:0.76390
[220]	validation_0-auc:0.76395
[221]	validation_0-auc:0.76399
[222]	validation_0-auc:0.76409
[223]	validation_0-auc:0.76407
[224]	validation_0-auc:0.76409
[225]	validation_0-auc:0.76409
[226]	validation_0-auc:0.76401
[227]	validation_0-auc:0.76397
[228]	validation_0-auc:0.76400
[229]	validation_0-auc:0.76419
[230]	validation_0-auc:0.76425
[231]	validation_0-auc:0.76426
[232]	validation_0-auc:0.76431
[233]	validation_0-auc:0.76436
[234]	validation_0-auc:0.76437
[235]	validation_0-auc:0.76439
[236]	validation_0-auc:0.76442
[237]	validation_0-auc:0.76442
[238]	validation_0-auc:0.76449
[239]	validation_0-auc:0.76458
[240]	validation_0-auc:0.76455
[241]	validation_0-auc:0.76450
[242]	validation_0-auc:0.76455
[243]	validation_0-auc:0.76451
[244]	validation_0-auc:0.76446
[245]	validation_0-auc:0.76449
[246]	validation_0-auc:0.76450
[247]	validation_0-auc:0.76459
[248]	validation_0-auc:0.76466
[249]	validation_0-auc:0.76475
[250]	validation_0-auc:0.76471
[251]	validation_0-auc:0.76467
[252]	validation_0-auc:0.76472
[253]	validation_0-auc:0.76468
[254]	validation_0-auc:0.76463
[255]	validation_0-auc:0.76463
[256]	validation_0-auc:0.76468
[257]	validation_0-auc:0.76467
[258]	validation_0-auc:0.76466
[259]	validation_0-auc:0.76460
[260]	validation_0-auc:0.76459
[261]	validation_0-auc:0.76459
[262]	validation_0-auc:0.76469
[263]	validation_0-auc:0.76475
[264]	validation_0-auc:0.76486
[265]	validation_0-auc:0.76496
[266]	validation_0-auc:0.76508
[267]	validation_0-auc:0.76515
[268]	validation_0-auc:0.76521
[269]	validation_0-auc:0.76521
[270]	validation_0-auc:0.76526
[271]	validation_0-auc:0.76542
[272]	validation_0-auc:0.76541
[273]	validation_0-auc:0.76554
[274]	validation_0-auc:0.76559
[275]	validation_0-auc:0.76558
[276]	validation_0-auc:0.76561
[277]	validation_0-auc:0.76560
[278]	validation_0-auc:0.76563
[279]	validation_0-auc:0.76552
[280]	validation_0-auc:0.76556
[281]	validation_0-auc:0.76560
[282]	validation_0-auc:0.76558
[283]	validation_0-auc:0.76570
[284]	validation_0-auc:0.76564
[285]	validation_0-auc:0.76561
[286]	validation_0-auc:0.76561
[287]	validation_0-auc:0.76563
[288]	validation_0-auc:0.76560
[289]	validation_0-auc:0.76553
[290]	validation_0-auc:0.76562
[291]	validation_0-auc:0.76566
[292]	validation_0-auc:0.76577
[293]	validation_0-auc:0.76578
[294]	validation_0-auc:0.76583
[295]	validation_0-auc:0.76583
[296]	validation_0-auc:0.76581
[297]	validation_0-auc:0.76568
[298]	validation_0-auc:0.76567
[299]	validation_0-auc:0.76574
[300]	validation_0-auc:0.76570
[301]	validation_0-auc:0.76569
[302]	validation_0-auc:0.76574
[303]	validation_0-auc:0.76572
[304]	validation_0-auc:0.76573
[305]	validation_0-auc:0.76574
[306]	validation_0-auc:0.76573
[307]	validation_0-auc:0.76564
[308]	validation_0-auc:0.76572
[309]	validation_0-auc:0.76579
[310]	validation_0-auc:0.76576
[311]	validation_0-auc:0.76570
[312]	validation_0-auc:0.76572
[313]	validation_0-auc:0.76582
[314]	validation_0-auc:0.76586
[315]	validation_0-auc:0.76598
[316]	validation_0-auc:0.76582
[317]	validation_0-auc:0.76581
[318]	validation_0-auc:0.76582
[319]	validation_0-auc:0.76583
[320]	validation_0-auc:0.76585
[321]	validation_0-auc:0.76596
[322]	validation_0-auc:0.76588
[323]	validation_0-auc:0.76595
[324]	validation_0-auc:0.76598
[325]	validation_0-auc:0.76596
[326]	validation_0-auc:0.76592
[327]	validation_0-auc:0.76590
[328]	validation_0-auc:0.76592
[329]	validation_0-auc:0.76610
[330]	validation_0-auc:0.76613
[331]	validation_0-auc:0.76619
[332]	validation_0-auc:0.76623
[333]	validation_0-auc:0.76626
[334]	validation_0-auc:0.76630
[335]	validation_0-auc:0.76631
[336]	validation_0-auc:0.76632
[337]	validation_0-auc:0.76633
[338]	validation_0-auc:0.76643
[339]	validation_0-auc:0.76637
[340]	validation_0-auc:0.76638
[341]	validation_0-auc:0.76643
[342]	validation_0-auc:0.76642
[343]	validation_0-auc:0.76638
[344]	validation_0-auc:0.76635
[345]	validation_0-auc:0.76641
[346]	validation_0-auc:0.76641
[347]	validation_0-auc:0.76644
[348]	validation_0-auc:0.76640
[349]	validation_0-auc:0.76649
[350]	validation_0-auc:0.76644
[351]	validation_0-auc:0.76631
[352]	validation_0-auc:0.76625
[353]	validation_0-auc:0.76618
[354]	validation_0-auc:0.76630
[355]	validation_0-auc:0.76616
[356]	validation_0-auc:0.76622
[357]	validation_0-auc:0.76625
[358]	validation_0-auc:0.76628
[359]	validation_0-auc:0.76633
[360]	validation_0-auc:0.76631
[361]	validation_0-auc:0.76636
[362]	validation_0-auc:0.76639
[363]	validation_0-auc:0.76641
[364]	validation_0-auc:0.76651
[365]	validation_0-auc:0.76652
[366]	validation_0-auc:0.76656
[367]	validation_0-auc:0.76662
[368]	validation_0-auc:0.76657
[369]	validation_0-auc:0.76653
[370]	validation_0-auc:0.76658
[371]	validation_0-auc:0.76674
[372]	validation_0-auc:0.76676
[373]	validation_0-auc:0.76674
[374]	validation_0-auc:0.76675
[375]	validation_0-auc:0.76665
[376]	validation_0-auc:0.76658
[377]	validation_0-auc:0.76649
[378]	validation_0-auc:0.76655
[379]	validation_0-auc:0.76648
[380]	validation_0-auc:0.76654
[381]	validation_0-auc:0.76658
[382]	validation_0-auc:0.76674
[383]	validation_0-auc:0.76670
[384]	validation_0-auc:0.76664
[385]	validation_0-auc:0.76659
[386]	validation_0-auc:0.76665
[387]	validation_0-auc:0.76667
[388]	validation_0-auc:0.76672
[389]	validation_0-auc:0.76682
[390]	validation_0-auc:0.76688
[391]	validation_0-auc:0.76697
[392]	validation_0-auc:0.76711
[393]	validation_0-auc:0.76712
[394]	validation_0-auc:0.76711
[395]	validation_0-auc:0.76711
[396]	validation_0-auc:0.76711
[397]	validation_0-auc:0.76707
[398]	validation_0-auc:0.76704
[399]	validation_0-auc:0.76708
[400]	validation_0-auc:0.76709
[401]	validation_0-auc:0.76710
[402]	validation_0-auc:0.76711
[403]	validation_0-auc:0.76701
[404]	validation_0-auc:0.76710
[405]	validation_0-auc:0.76707
[406]	validation_0-auc:0.76708
[407]	validation_0-auc:0.76707
[408]	validation_0-auc:0.76699
[409]	validation_0-auc:0.76695
[410]	validation_0-auc:0.76694
[411]	validation_0-auc:0.76702
[412]	validation_0-auc:0.76703
[413]	validation_0-auc:0.76694
[414]	validation_0-auc:0.76697
[415]	validation_0-auc:0.76697
[416]	validation_0-auc:0.76710
[417]	validation_0-auc:0.76715
[418]	validation_0-auc:0.76721
[419]	validation_0-auc:0.76714
[420]	validation_0-auc:0.76710
[421]	validation_0-auc:0.76709
[422]	validation_0-auc:0.76714
[423]	validation_0-auc:0.76708
[424]	validation_0-auc:0.76713
[425]	validation_0-auc:0.76716
[426]	validation_0-auc:0.76722
[427]	validation_0-auc:0.76720
[428]	validation_0-auc:0.76717
[429]	validation_0-auc:0.76700
[430]	validation_0-auc:0.76701
[431]	validation_0-auc:0.76698
[432]	validation_0-auc:0.76699
[433]	validation_0-auc:0.76700
[434]	validation_0-auc:0.76702
[435]	validation_0-auc:0.76704
[436]	validation_0-auc:0.76711
[437]	validation_0-auc:0.76719
[438]	validation_0-auc:0.76714
[439]	validation_0-auc:0.76721
[440]	validation_0-auc:0.76728
[441]	validation_0-auc:0.76723
[442]	validation_0-auc:0.76731
[443]	validation_0-auc:0.76734
[444]	validation_0-auc:0.76733
[445]	validation_0-auc:0.76740
[446]	validation_0-auc:0.76746
[447]	validation_0-auc:0.76746
[448]	validation_0-auc:0.76750
[449]	validation_0-auc:0.76747
[450]	validation_0-auc:0.76745
[451]	validation_0-auc:0.76745
[452]	validation_0-auc:0.76742
[453]	validation_0-auc:0.76741
[454]	validation_0-auc:0.76739
[455]	validation_0-auc:0.76738
[456]	validation_0-auc:0.76737
[457]	validation_0-auc:0.76738
[458]	validation_0-auc:0.76735
[459]	validation_0-auc:0.76735
[460]	validation_0-auc:0.76733
[461]	validation_0-auc:0.76730
[462]	validation_0-auc:0.76731
[463]	validation_0-auc:0.76727
[464]	validation_0-auc:0.76729
[465]	validation_0-auc:0.76728
[466]	validation_0-auc:0.76719
[467]	validation_0-auc:0.76714
[468]	validation_0-auc:0.76714
[469]	validation_0-auc:0.76708
[470]	validation_0-auc:0.76705
[471]	validation_0-auc:0.76710
[472]	validation_0-auc:0.76709
[473]	validation_0-auc:0.76712
[474]	validation_0-auc:0.76704
[475]	validation_0-auc:0.76707
[476]	validation_0-auc:0.76709
[477]	validation_0-auc:0.76708
[478]	validation_0-auc:0.76705
[479]	validation_0-auc:0.76712
[480]	validation_0-auc:0.76712
[481]	validation_0-auc:0.76710
[482]	validation_0-auc:0.76701
[483]	validation_0-auc:0.76701
[484]	validation_0-auc:0.76700
[485]	validation_0-auc:0.76701
[486]	validation_0-auc:0.76701
[487]	validation_0-auc:0.76705
[488]	validation_0-auc:0.76710
[489]	validation_0-auc:0.76713
[490]	validation_0-auc:0.76718
[491]	validation_0-auc:0.76726
[492]	validation_0-auc:0.76718
[493]	validation_0-auc:0.76724
[494]	validation_0-auc:0.76723
[495]	validation_0-auc:0.76722
[496]	validation_0-auc:0.76724
[497]	validation_0-auc:0.76721
[498]	validation_0-auc:0.76716
Stopping. Best iteration:
[448]	validation_0-auc:0.76750

****************************************


----------------------------------------


****************************************


Best Parameters:
	predictor__colsample_bytree: 0.8
	predictor__eta: 0.05
	predictor__max_depth: 3
	predictor__min_child_weight: 20
	predictor__n_estimators: 500
	predictor__verbose_eval: False
	predictor__verbosity: 0
****** FINISH Best Model Manual Balanced 1:1 XGBoost  *****

----------------------------------------
****************************************
----------------------------------------
****** START Best Model Manual Balanced 1:1 LightGBM *****
Parameters:
	boosting_type: ['gbdt']
	colsample_bytree: [0.8, 0.95]
	is_unbalance: [True]
	learning_rate: [0.01, 0.05, 0.1]
	min_split_gain: [0.01, 0.02, 0.05]
	n_estimators: [5000, 10000]
	num_leaves: [30, 35]
	reg_alpha: [0.01, 0.02, 0.05]
	reg_lambda: [0.01, 0.02, 0.05]
Fitting 5 folds for each of 648 candidates, totalling 3240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 12.0min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed: 29.1min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 40.7min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed: 47.9min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed: 64.8min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed: 99.9min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed: 111.7min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed: 112.3min finished
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[781]	valid's auc: 0.764741	valid's binary_logloss: 0.579647
****************************************


----------------------------------------


****************************************


Best Parameters:
	predictor__boosting_type: gbdt
	predictor__colsample_bytree: 0.8
	predictor__is_unbalance: True
	predictor__learning_rate: 0.01
	predictor__min_split_gain: 0.02
	predictor__n_estimators: 5000
	predictor__num_leaves: 30
	predictor__reg_alpha: 0.02
	predictor__reg_lambda: 0.02
****** FINISH Best Model Manual Balanced 1:1 LightGBM  *****

----------------------------------------
****************************************
----------------------------------------
Model Train Accuracy Score Train AUC Score Test Accuracy Score Test AUC Score Training Time Experiments Description
0 Best Model Manual Balanced 1:1XGBoost 0.747 0.83 0.702 0.772 10.8852 360 [["predictor__colsample_bytree", 0.8], ["predi...
1 Best Model Manual Balanced 1:1LightGBM 0.994 1.00 0.699 0.770 42.8156 3240 [["predictor__boosting_type", "gbdt"], ["predi...
CPU times: user 10min 49s, sys: 22.5 s, total: 11min 12s
Wall time: 3h 2min 34s

Nuemrically Scaled Dataset - Unbalanced

In [99]:
X = df_train_reduced.drop('TARGET', axis = 1).values
y = df_train_reduced['TARGET'].values

X_train, X_test, y_train, y_test = train_test_split(X, y , stratify = y, 
                                                    test_size = 0.3, random_state = 42)
In [100]:
%%time

from sklearn.preprocessing import StandardScaler

# This might take a while
if __name__ == "__main__":
  
    Scaled_ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model Scaled:",  n_jobs=-1,verbose=1)
****** START Best Model Scaled: XGBoost *****
Parameters:
	predictor__colsample_bytree: [0.8, 0.95]
	predictor__eta: [0.01, 0.05, 0.1]
	predictor__max_depth: [3, 6, 8]
	predictor__min_child_weight: [20, 35]
	predictor__n_estimators: [500, 1000]
	predictor__verbose_eval: [False]
	predictor__verbosity: [0]
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 11.1min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 51.0min
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed: 120.5min finished
Columns split completed............
Scaling completed............
[0]	validation_0-auc:0.65198
Will train until validation_0-auc hasn't improved in 50 rounds.
[1]	validation_0-auc:0.66776
[2]	validation_0-auc:0.67375
[3]	validation_0-auc:0.67013
[4]	validation_0-auc:0.66529
[5]	validation_0-auc:0.66358
[6]	validation_0-auc:0.65722
[7]	validation_0-auc:0.66327
[8]	validation_0-auc:0.66863
[9]	validation_0-auc:0.67231
[10]	validation_0-auc:0.68103
[11]	validation_0-auc:0.67981
[12]	validation_0-auc:0.68333
[13]	validation_0-auc:0.68192
[14]	validation_0-auc:0.68305
[15]	validation_0-auc:0.68335
[16]	validation_0-auc:0.68151
[17]	validation_0-auc:0.68060
[18]	validation_0-auc:0.68094
[19]	validation_0-auc:0.68196
[20]	validation_0-auc:0.68249
[21]	validation_0-auc:0.67791
[22]	validation_0-auc:0.67448
[23]	validation_0-auc:0.67470
[24]	validation_0-auc:0.67041
[25]	validation_0-auc:0.66851
[26]	validation_0-auc:0.66455
[27]	validation_0-auc:0.66508
[28]	validation_0-auc:0.65714
[29]	validation_0-auc:0.65784
[30]	validation_0-auc:0.65590
[31]	validation_0-auc:0.66327
[32]	validation_0-auc:0.66487
[33]	validation_0-auc:0.67392
[34]	validation_0-auc:0.67203
[35]	validation_0-auc:0.67485
[36]	validation_0-auc:0.67196
[37]	validation_0-auc:0.66903
[38]	validation_0-auc:0.67097
[39]	validation_0-auc:0.66850
[40]	validation_0-auc:0.66803
[41]	validation_0-auc:0.66667
[42]	validation_0-auc:0.67138
[43]	validation_0-auc:0.67528
[44]	validation_0-auc:0.67453
[45]	validation_0-auc:0.67417
[46]	validation_0-auc:0.67623
[47]	validation_0-auc:0.67426
[48]	validation_0-auc:0.67423
[49]	validation_0-auc:0.67333
[50]	validation_0-auc:0.67333
[51]	validation_0-auc:0.67232
[52]	validation_0-auc:0.67267
[53]	validation_0-auc:0.67455
[54]	validation_0-auc:0.67244
[55]	validation_0-auc:0.67244
[56]	validation_0-auc:0.67248
[57]	validation_0-auc:0.67368
[58]	validation_0-auc:0.67111
[59]	validation_0-auc:0.67074
[60]	validation_0-auc:0.67072
[61]	validation_0-auc:0.67209
[62]	validation_0-auc:0.67142
[63]	validation_0-auc:0.66998
[64]	validation_0-auc:0.67030
[65]	validation_0-auc:0.66890
Stopping. Best iteration:
[15]	validation_0-auc:0.68335

Columns split completed............
Scaling completed............
Scaling completed............
Scaling completed............
Scaling completed............
Scaling completed............
****************************************


Scaling completed............
----------------------------------------


Scaling completed............
****************************************


Best Parameters:
	predictor__colsample_bytree: 0.8
	predictor__eta: 0.1
	predictor__max_depth: 8
	predictor__min_child_weight: 35
	predictor__n_estimators: 500
	predictor__verbose_eval: False
	predictor__verbosity: 0
****** FINISH Best Model Scaled: XGBoost  *****

----------------------------------------
****************************************
----------------------------------------
****** START Best Model Scaled: LightGBM *****
Parameters:
	predictor__boosting_type: ['gbdt']
	predictor__colsample_bytree: [0.8, 0.95]
	predictor__is_unbalance: [True]
	predictor__learning_rate: [0.01, 0.05, 0.1]
	predictor__min_split_gain: [0.01, 0.02, 0.05]
	predictor__n_estimators: [5000, 10000]
	predictor__num_leaves: [30, 35]
	predictor__reg_alpha: [0.01, 0.02, 0.05]
	predictor__reg_lambda: [0.01, 0.02, 0.05]
Fitting 5 folds for each of 648 candidates, totalling 3240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 12.6min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 22.3min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed: 33.9min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 48.4min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 66.6min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 86.2min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed: 87.4min finished
Columns split completed............
Scaling completed............
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[6]	valid's auc: 0.6762	valid's binary_logloss: 0.27905
Columns split completed............
Scaling completed............
Scaling completed............
Scaling completed............
Scaling completed............
Scaling completed............
****************************************


Scaling completed............
----------------------------------------


Scaling completed............
****************************************


Best Parameters:
	predictor__boosting_type: gbdt
	predictor__colsample_bytree: 0.8
	predictor__is_unbalance: True
	predictor__learning_rate: 0.01
	predictor__min_split_gain: 0.01
	predictor__n_estimators: 5000
	predictor__num_leaves: 35
	predictor__reg_alpha: 0.01
	predictor__reg_lambda: 0.01
****** FINISH Best Model Scaled: LightGBM  *****

----------------------------------------
****************************************
----------------------------------------
Model Train Accuracy Score Train AUC Score Test Accuracy Score Test AUC Score Training Time Experiments Description
0 Best Model Scaled:XGBoost 0.92 0.811 0.919 0.686 139.1765 360 [["predictor__colsample_bytree", 0.8], ["predi...
1 Best Model Scaled:LightGBM 0.92 0.990 0.919 0.556 98.6863 3240 [["predictor__boosting_type", "gbdt"], ["predi...
CPU times: user 21min 4s, sys: 39.1 s, total: 21min 43s
Wall time: 3h 33min 22s

PCA of Unbalanced Dataset

In [35]:
%%time

from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingClassifier

# This might take a while
if __name__ == "__main__":
    
    PCA_ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model PCA:",  n_jobs=-1,verbose=1)
****** START Best Model PCA: LightGBM *****
Parameters:
	pca__n_components: [0.95, 0.99, 0.999]
	predictor__boosting_type: ['gbdt']
	predictor__colsample_bytree: [0.8]
	predictor__learning_rate: [0.01, 0.05, 0.1]
	predictor__min_split_gain: [0.01, 0.02, 0.05]
	predictor__n_estimators: [5000, 10000]
	predictor__num_leaves: [30, 35]
Fitting 5 folds for each of 108 candidates, totalling 540 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  8.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 20.7min
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed: 25.8min finished
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[9]	valid's auc: 0.600914	valid's binary_logloss: 0.28045
****************************************


----------------------------------------


****************************************


Best Parameters:
	pca__n_components: 0.95
	predictor__boosting_type: gbdt
	predictor__colsample_bytree: 0.8
	predictor__learning_rate: 0.05
	predictor__min_split_gain: 0.01
	predictor__n_estimators: 5000
	predictor__num_leaves: 30
****** FINISH Best Model PCA: LightGBM  *****

----------------------------------------
****************************************
----------------------------------------
Model Train Accuracy Score Train AUC Score Test Accuracy Score Test AUC Score Training Time Experiments Description
0 Best Model PCA:LightGBM 1.0 1.0 0.92 0.72 118.477 180 [["pca__n_components", 0.95], ["predictor__boo...
CPU times: user 10min 1s, sys: 27.4 s, total: 10min 29s
Wall time: 28min 28s

SMOTE Balanced

In [22]:
X = df_train_reduced.drop('TARGET', axis = 1).values
y = df_train_reduced['TARGET'].values

X_train, X_test, y_train, y_test = train_test_split(X, y , stratify = y, 
                                                    test_size = 0.3, random_state = 42)
In [23]:
%%time
# This might take a while
from imblearn.pipeline import Pipeline as imbpipe
from imblearn.over_sampling import SMOTE
if __name__ == "__main__":
    

    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    # n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.
    #
    # By default, the GridSearchCV uses a 3-fold cross-validation. However, if it 
    #            detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold.
    SMOTE_ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model SMOTE:",  n_jobs=-1,verbose=1)
    
#     display(explog)
****** START Best Model SMOTE: XGBoost *****
Parameters:
	predictor__colsample_bytree: [0.8]
	predictor__eta: [0.01, 0.05, 0.1]
	predictor__max_depth: [6, 8]
	predictor__min_child_weight: [20, 35]
	predictor__n_estimators: [500, 1000]
	predictor__reg_alpha: [0.01]
	predictor__reg_lambda: [0.05]
	predictor__verbose_eval: [False]
	predictor__verbosity: [0]
	sampling__sampling_strategy: [0.25, 0.5, 'auto']
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 169.9min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 707.4min
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed: 975.6min finished
[0]	validation_0-auc:0.69613
Will train until validation_0-auc hasn't improved in 50 rounds.
[1]	validation_0-auc:0.70493
[2]	validation_0-auc:0.71135
[3]	validation_0-auc:0.71375
[4]	validation_0-auc:0.71547
[5]	validation_0-auc:0.71536
[6]	validation_0-auc:0.71745
[7]	validation_0-auc:0.71662
[8]	validation_0-auc:0.71730
[9]	validation_0-auc:0.71841
[10]	validation_0-auc:0.71886
[11]	validation_0-auc:0.72059
[12]	validation_0-auc:0.72212
[13]	validation_0-auc:0.72452
[14]	validation_0-auc:0.72462
[15]	validation_0-auc:0.72507
[16]	validation_0-auc:0.72478
[17]	validation_0-auc:0.72562
[18]	validation_0-auc:0.72679
[19]	validation_0-auc:0.72681
[20]	validation_0-auc:0.72684
[21]	validation_0-auc:0.72748
[22]	validation_0-auc:0.72820
[23]	validation_0-auc:0.72868
[24]	validation_0-auc:0.72868
[25]	validation_0-auc:0.72868
[26]	validation_0-auc:0.72891
[27]	validation_0-auc:0.72944
[28]	validation_0-auc:0.73030
[29]	validation_0-auc:0.73016
[30]	validation_0-auc:0.73006
[31]	validation_0-auc:0.72989
[32]	validation_0-auc:0.73049
[33]	validation_0-auc:0.73070
[34]	validation_0-auc:0.73073
[35]	validation_0-auc:0.73100
[36]	validation_0-auc:0.73141
[37]	validation_0-auc:0.73171
[38]	validation_0-auc:0.73169
[39]	validation_0-auc:0.73208
[40]	validation_0-auc:0.73193
[41]	validation_0-auc:0.73176
[42]	validation_0-auc:0.73188
[43]	validation_0-auc:0.73184
[44]	validation_0-auc:0.73196
[45]	validation_0-auc:0.73199
[46]	validation_0-auc:0.73196
[47]	validation_0-auc:0.73209
[48]	validation_0-auc:0.73236
[49]	validation_0-auc:0.73288
[50]	validation_0-auc:0.73296
[51]	validation_0-auc:0.73297
[52]	validation_0-auc:0.73322
[53]	validation_0-auc:0.73349
[54]	validation_0-auc:0.73337
[55]	validation_0-auc:0.73333
[56]	validation_0-auc:0.73356
[57]	validation_0-auc:0.73384
[58]	validation_0-auc:0.73404
[59]	validation_0-auc:0.73451
[60]	validation_0-auc:0.73449
[61]	validation_0-auc:0.73454
[62]	validation_0-auc:0.73455
[63]	validation_0-auc:0.73459
[64]	validation_0-auc:0.73480
[65]	validation_0-auc:0.73492
[66]	validation_0-auc:0.73509
[67]	validation_0-auc:0.73507
[68]	validation_0-auc:0.73500
[69]	validation_0-auc:0.73506
[70]	validation_0-auc:0.73505
[71]	validation_0-auc:0.73530
[72]	validation_0-auc:0.73544
[73]	validation_0-auc:0.73550
[74]	validation_0-auc:0.73560
[75]	validation_0-auc:0.73554
[76]	validation_0-auc:0.73569
[77]	validation_0-auc:0.73575
[78]	validation_0-auc:0.73589
[79]	validation_0-auc:0.73597
[80]	validation_0-auc:0.73610
[81]	validation_0-auc:0.73629
[82]	validation_0-auc:0.73630
[83]	validation_0-auc:0.73646
[84]	validation_0-auc:0.73667
[85]	validation_0-auc:0.73667
[86]	validation_0-auc:0.73675
[87]	validation_0-auc:0.73704
[88]	validation_0-auc:0.73704
[89]	validation_0-auc:0.73707
[90]	validation_0-auc:0.73720
[91]	validation_0-auc:0.73725
[92]	validation_0-auc:0.73729
[93]	validation_0-auc:0.73737
[94]	validation_0-auc:0.73745
[95]	validation_0-auc:0.73758
[96]	validation_0-auc:0.73785
[97]	validation_0-auc:0.73791
[98]	validation_0-auc:0.73807
[99]	validation_0-auc:0.73821
[100]	validation_0-auc:0.73837
[101]	validation_0-auc:0.73849
[102]	validation_0-auc:0.73857
[103]	validation_0-auc:0.73868
[104]	validation_0-auc:0.73882
[105]	validation_0-auc:0.73889
[106]	validation_0-auc:0.73897
[107]	validation_0-auc:0.73908
[108]	validation_0-auc:0.73935
[109]	validation_0-auc:0.73949
[110]	validation_0-auc:0.73986
[111]	validation_0-auc:0.73990
[112]	validation_0-auc:0.73993
[113]	validation_0-auc:0.74020
[114]	validation_0-auc:0.74041
[115]	validation_0-auc:0.74049
[116]	validation_0-auc:0.74060
[117]	validation_0-auc:0.74076
[118]	validation_0-auc:0.74096
[119]	validation_0-auc:0.74106
[120]	validation_0-auc:0.74116
[121]	validation_0-auc:0.74126
[122]	validation_0-auc:0.74151
[123]	validation_0-auc:0.74155
[124]	validation_0-auc:0.74155
[125]	validation_0-auc:0.74162
[126]	validation_0-auc:0.74175
[127]	validation_0-auc:0.74203
[128]	validation_0-auc:0.74202
[129]	validation_0-auc:0.74206
[130]	validation_0-auc:0.74208
[131]	validation_0-auc:0.74209
[132]	validation_0-auc:0.74228
[133]	validation_0-auc:0.74230
[134]	validation_0-auc:0.74240
[135]	validation_0-auc:0.74233
[136]	validation_0-auc:0.74241
[137]	validation_0-auc:0.74254
[138]	validation_0-auc:0.74262
[139]	validation_0-auc:0.74254
[140]	validation_0-auc:0.74259
[141]	validation_0-auc:0.74268
[142]	validation_0-auc:0.74264
[143]	validation_0-auc:0.74260
[144]	validation_0-auc:0.74267
[145]	validation_0-auc:0.74266
[146]	validation_0-auc:0.74279
[147]	validation_0-auc:0.74290
[148]	validation_0-auc:0.74295
[149]	validation_0-auc:0.74295
[150]	validation_0-auc:0.74295
[151]	validation_0-auc:0.74301
[152]	validation_0-auc:0.74306
[153]	validation_0-auc:0.74319
[154]	validation_0-auc:0.74336
[155]	validation_0-auc:0.74336
[156]	validation_0-auc:0.74350
[157]	validation_0-auc:0.74359
[158]	validation_0-auc:0.74359
[159]	validation_0-auc:0.74373
[160]	validation_0-auc:0.74381
[161]	validation_0-auc:0.74389
[162]	validation_0-auc:0.74406
[163]	validation_0-auc:0.74426
[164]	validation_0-auc:0.74430
[165]	validation_0-auc:0.74447
[166]	validation_0-auc:0.74448
[167]	validation_0-auc:0.74454
[168]	validation_0-auc:0.74467
[169]	validation_0-auc:0.74471
[170]	validation_0-auc:0.74480
[171]	validation_0-auc:0.74486
[172]	validation_0-auc:0.74488
[173]	validation_0-auc:0.74486
[174]	validation_0-auc:0.74484
[175]	validation_0-auc:0.74494
[176]	validation_0-auc:0.74499
[177]	validation_0-auc:0.74508
[178]	validation_0-auc:0.74530
[179]	validation_0-auc:0.74536
[180]	validation_0-auc:0.74547
[181]	validation_0-auc:0.74565
[182]	validation_0-auc:0.74578
[183]	validation_0-auc:0.74577
[184]	validation_0-auc:0.74584
[185]	validation_0-auc:0.74589
[186]	validation_0-auc:0.74610
[187]	validation_0-auc:0.74617
[188]	validation_0-auc:0.74622
[189]	validation_0-auc:0.74626
[190]	validation_0-auc:0.74625
[191]	validation_0-auc:0.74631
[192]	validation_0-auc:0.74643
[193]	validation_0-auc:0.74653
[194]	validation_0-auc:0.74664
[195]	validation_0-auc:0.74677
[196]	validation_0-auc:0.74676
[197]	validation_0-auc:0.74684
[198]	validation_0-auc:0.74696
[199]	validation_0-auc:0.74694
[200]	validation_0-auc:0.74708
[201]	validation_0-auc:0.74718
[202]	validation_0-auc:0.74719
[203]	validation_0-auc:0.74726
[204]	validation_0-auc:0.74738
[205]	validation_0-auc:0.74752
[206]	validation_0-auc:0.74757
[207]	validation_0-auc:0.74761
[208]	validation_0-auc:0.74766
[209]	validation_0-auc:0.74770
[210]	validation_0-auc:0.74777
[211]	validation_0-auc:0.74789
[212]	validation_0-auc:0.74794
[213]	validation_0-auc:0.74805
[214]	validation_0-auc:0.74811
[215]	validation_0-auc:0.74815
[216]	validation_0-auc:0.74825
[217]	validation_0-auc:0.74833
[218]	validation_0-auc:0.74843
[219]	validation_0-auc:0.74842
[220]	validation_0-auc:0.74854
[221]	validation_0-auc:0.74862
[222]	validation_0-auc:0.74869
[223]	validation_0-auc:0.74873
[224]	validation_0-auc:0.74887
[225]	validation_0-auc:0.74890
[226]	validation_0-auc:0.74893
[227]	validation_0-auc:0.74909
[228]	validation_0-auc:0.74918
[229]	validation_0-auc:0.74929
[230]	validation_0-auc:0.74936
[231]	validation_0-auc:0.74938
[232]	validation_0-auc:0.74946
[233]	validation_0-auc:0.74953
[234]	validation_0-auc:0.74968
[235]	validation_0-auc:0.74970
[236]	validation_0-auc:0.74978
[237]	validation_0-auc:0.74985
[238]	validation_0-auc:0.74986
[239]	validation_0-auc:0.74989
[240]	validation_0-auc:0.75003
[241]	validation_0-auc:0.75012
[242]	validation_0-auc:0.75023
[243]	validation_0-auc:0.75030
[244]	validation_0-auc:0.75049
[245]	validation_0-auc:0.75056
[246]	validation_0-auc:0.75064
[247]	validation_0-auc:0.75069
[248]	validation_0-auc:0.75083
[249]	validation_0-auc:0.75091
[250]	validation_0-auc:0.75096
[251]	validation_0-auc:0.75103
[252]	validation_0-auc:0.75103
[253]	validation_0-auc:0.75110
[254]	validation_0-auc:0.75123
[255]	validation_0-auc:0.75128
[256]	validation_0-auc:0.75131
[257]	validation_0-auc:0.75139
[258]	validation_0-auc:0.75152
[259]	validation_0-auc:0.75158
[260]	validation_0-auc:0.75163
[261]	validation_0-auc:0.75172
[262]	validation_0-auc:0.75173
[263]	validation_0-auc:0.75173
[264]	validation_0-auc:0.75176
[265]	validation_0-auc:0.75175
[266]	validation_0-auc:0.75189
[267]	validation_0-auc:0.75196
[268]	validation_0-auc:0.75204
[269]	validation_0-auc:0.75209
[270]	validation_0-auc:0.75213
[271]	validation_0-auc:0.75224
[272]	validation_0-auc:0.75231
[273]	validation_0-auc:0.75233
[274]	validation_0-auc:0.75241
[275]	validation_0-auc:0.75246
[276]	validation_0-auc:0.75251
[277]	validation_0-auc:0.75258
[278]	validation_0-auc:0.75264
[279]	validation_0-auc:0.75268
[280]	validation_0-auc:0.75275
[281]	validation_0-auc:0.75277
[282]	validation_0-auc:0.75287
[283]	validation_0-auc:0.75291
[284]	validation_0-auc:0.75299
[285]	validation_0-auc:0.75305
[286]	validation_0-auc:0.75312
[287]	validation_0-auc:0.75316
[288]	validation_0-auc:0.75321
[289]	validation_0-auc:0.75325
[290]	validation_0-auc:0.75334
[291]	validation_0-auc:0.75345
[292]	validation_0-auc:0.75355
[293]	validation_0-auc:0.75358
[294]	validation_0-auc:0.75358
[295]	validation_0-auc:0.75375
[296]	validation_0-auc:0.75386
[297]	validation_0-auc:0.75390
[298]	validation_0-auc:0.75397
[299]	validation_0-auc:0.75404
[300]	validation_0-auc:0.75408
[301]	validation_0-auc:0.75418
[302]	validation_0-auc:0.75423
[303]	validation_0-auc:0.75430
[304]	validation_0-auc:0.75431
[305]	validation_0-auc:0.75440
[306]	validation_0-auc:0.75443
[307]	validation_0-auc:0.75452
[308]	validation_0-auc:0.75461
[309]	validation_0-auc:0.75463
[310]	validation_0-auc:0.75469
[311]	validation_0-auc:0.75473
[312]	validation_0-auc:0.75486
[313]	validation_0-auc:0.75492
[314]	validation_0-auc:0.75503
[315]	validation_0-auc:0.75505
[316]	validation_0-auc:0.75512
[317]	validation_0-auc:0.75521
[318]	validation_0-auc:0.75522
[319]	validation_0-auc:0.75532
[320]	validation_0-auc:0.75540
[321]	validation_0-auc:0.75548
[322]	validation_0-auc:0.75551
[323]	validation_0-auc:0.75556
[324]	validation_0-auc:0.75563
[325]	validation_0-auc:0.75572
[326]	validation_0-auc:0.75579
[327]	validation_0-auc:0.75587
[328]	validation_0-auc:0.75594
[329]	validation_0-auc:0.75603
[330]	validation_0-auc:0.75611
[331]	validation_0-auc:0.75617
[332]	validation_0-auc:0.75621
[333]	validation_0-auc:0.75623
[334]	validation_0-auc:0.75630
[335]	validation_0-auc:0.75636
[336]	validation_0-auc:0.75644
[337]	validation_0-auc:0.75650
[338]	validation_0-auc:0.75655
[339]	validation_0-auc:0.75663
[340]	validation_0-auc:0.75670
[341]	validation_0-auc:0.75679
[342]	validation_0-auc:0.75687
[343]	validation_0-auc:0.75692
[344]	validation_0-auc:0.75694
[345]	validation_0-auc:0.75701
[346]	validation_0-auc:0.75711
[347]	validation_0-auc:0.75718
[348]	validation_0-auc:0.75724
[349]	validation_0-auc:0.75735
[350]	validation_0-auc:0.75738
[351]	validation_0-auc:0.75745
[352]	validation_0-auc:0.75753
[353]	validation_0-auc:0.75757
[354]	validation_0-auc:0.75761
[355]	validation_0-auc:0.75771
[356]	validation_0-auc:0.75785
[357]	validation_0-auc:0.75791
[358]	validation_0-auc:0.75793
[359]	validation_0-auc:0.75803
[360]	validation_0-auc:0.75809
[361]	validation_0-auc:0.75813
[362]	validation_0-auc:0.75821
[363]	validation_0-auc:0.75827
[364]	validation_0-auc:0.75829
[365]	validation_0-auc:0.75835
[366]	validation_0-auc:0.75838
[367]	validation_0-auc:0.75847
[368]	validation_0-auc:0.75852
[369]	validation_0-auc:0.75852
[370]	validation_0-auc:0.75857
[371]	validation_0-auc:0.75864
[372]	validation_0-auc:0.75868
[373]	validation_0-auc:0.75877
[374]	validation_0-auc:0.75882
[375]	validation_0-auc:0.75887
[376]	validation_0-auc:0.75889
[377]	validation_0-auc:0.75891
[378]	validation_0-auc:0.75895
[379]	validation_0-auc:0.75905
[380]	validation_0-auc:0.75907
[381]	validation_0-auc:0.75908
[382]	validation_0-auc:0.75910
[383]	validation_0-auc:0.75917
[384]	validation_0-auc:0.75918
[385]	validation_0-auc:0.75924
[386]	validation_0-auc:0.75925
[387]	validation_0-auc:0.75930
[388]	validation_0-auc:0.75935
[389]	validation_0-auc:0.75941
[390]	validation_0-auc:0.75946
[391]	validation_0-auc:0.75954
[392]	validation_0-auc:0.75964
[393]	validation_0-auc:0.75972
[394]	validation_0-auc:0.75975
[395]	validation_0-auc:0.75978
[396]	validation_0-auc:0.75988
[397]	validation_0-auc:0.75995
[398]	validation_0-auc:0.76001
[399]	validation_0-auc:0.76012
[400]	validation_0-auc:0.76011
[401]	validation_0-auc:0.76016
[402]	validation_0-auc:0.76020
[403]	validation_0-auc:0.76022
[404]	validation_0-auc:0.76023
[405]	validation_0-auc:0.76030
[406]	validation_0-auc:0.76038
[407]	validation_0-auc:0.76042
[408]	validation_0-auc:0.76051
[409]	validation_0-auc:0.76051
[410]	validation_0-auc:0.76053
[411]	validation_0-auc:0.76060
[412]	validation_0-auc:0.76067
[413]	validation_0-auc:0.76072
[414]	validation_0-auc:0.76080
[415]	validation_0-auc:0.76086
[416]	validation_0-auc:0.76090
[417]	validation_0-auc:0.76098
[418]	validation_0-auc:0.76103
[419]	validation_0-auc:0.76111
[420]	validation_0-auc:0.76114
[421]	validation_0-auc:0.76115
[422]	validation_0-auc:0.76125
[423]	validation_0-auc:0.76133
[424]	validation_0-auc:0.76135
[425]	validation_0-auc:0.76145
[426]	validation_0-auc:0.76151
[427]	validation_0-auc:0.76152
[428]	validation_0-auc:0.76160
[429]	validation_0-auc:0.76158
[430]	validation_0-auc:0.76158
[431]	validation_0-auc:0.76164
[432]	validation_0-auc:0.76166
[433]	validation_0-auc:0.76171
[434]	validation_0-auc:0.76173
[435]	validation_0-auc:0.76176
[436]	validation_0-auc:0.76184
[437]	validation_0-auc:0.76188
[438]	validation_0-auc:0.76194
[439]	validation_0-auc:0.76202
[440]	validation_0-auc:0.76205
[441]	validation_0-auc:0.76213
[442]	validation_0-auc:0.76216
[443]	validation_0-auc:0.76220
[444]	validation_0-auc:0.76224
[445]	validation_0-auc:0.76230
[446]	validation_0-auc:0.76238
[447]	validation_0-auc:0.76246
[448]	validation_0-auc:0.76251
[449]	validation_0-auc:0.76253
[450]	validation_0-auc:0.76263
[451]	validation_0-auc:0.76267
[452]	validation_0-auc:0.76273
[453]	validation_0-auc:0.76283
[454]	validation_0-auc:0.76295
[455]	validation_0-auc:0.76296
[456]	validation_0-auc:0.76299
[457]	validation_0-auc:0.76299
[458]	validation_0-auc:0.76305
[459]	validation_0-auc:0.76307
[460]	validation_0-auc:0.76315
[461]	validation_0-auc:0.76317
[462]	validation_0-auc:0.76316
[463]	validation_0-auc:0.76316
[464]	validation_0-auc:0.76319
[465]	validation_0-auc:0.76323
[466]	validation_0-auc:0.76327
[467]	validation_0-auc:0.76332
[468]	validation_0-auc:0.76331
[469]	validation_0-auc:0.76336
[470]	validation_0-auc:0.76341
[471]	validation_0-auc:0.76347
[472]	validation_0-auc:0.76349
[473]	validation_0-auc:0.76356
[474]	validation_0-auc:0.76355
[475]	validation_0-auc:0.76355
[476]	validation_0-auc:0.76357
[477]	validation_0-auc:0.76364
[478]	validation_0-auc:0.76369
[479]	validation_0-auc:0.76373
[480]	validation_0-auc:0.76375
[481]	validation_0-auc:0.76384
[482]	validation_0-auc:0.76390
[483]	validation_0-auc:0.76396
[484]	validation_0-auc:0.76399
[485]	validation_0-auc:0.76405
[486]	validation_0-auc:0.76413
[487]	validation_0-auc:0.76416
[488]	validation_0-auc:0.76416
[489]	validation_0-auc:0.76421
[490]	validation_0-auc:0.76428
[491]	validation_0-auc:0.76430
[492]	validation_0-auc:0.76429
[493]	validation_0-auc:0.76432
[494]	validation_0-auc:0.76431
[495]	validation_0-auc:0.76435
[496]	validation_0-auc:0.76438
[497]	validation_0-auc:0.76441
[498]	validation_0-auc:0.76445
[499]	validation_0-auc:0.76452
[500]	validation_0-auc:0.76454
[501]	validation_0-auc:0.76456
[502]	validation_0-auc:0.76458
[503]	validation_0-auc:0.76461
[504]	validation_0-auc:0.76459
[505]	validation_0-auc:0.76460
[506]	validation_0-auc:0.76468
[507]	validation_0-auc:0.76470
[508]	validation_0-auc:0.76473
[509]	validation_0-auc:0.76471
[510]	validation_0-auc:0.76477
[511]	validation_0-auc:0.76480
[512]	validation_0-auc:0.76482
[513]	validation_0-auc:0.76487
[514]	validation_0-auc:0.76489
[515]	validation_0-auc:0.76492
[516]	validation_0-auc:0.76493
[517]	validation_0-auc:0.76496
[518]	validation_0-auc:0.76500
[519]	validation_0-auc:0.76504
[520]	validation_0-auc:0.76508
[521]	validation_0-auc:0.76508
[522]	validation_0-auc:0.76514
[523]	validation_0-auc:0.76517
[524]	validation_0-auc:0.76518
[525]	validation_0-auc:0.76521
[526]	validation_0-auc:0.76523
[527]	validation_0-auc:0.76526
[528]	validation_0-auc:0.76530
[529]	validation_0-auc:0.76530
[530]	validation_0-auc:0.76535
[531]	validation_0-auc:0.76536
[532]	validation_0-auc:0.76546
[533]	validation_0-auc:0.76549
[534]	validation_0-auc:0.76548
[535]	validation_0-auc:0.76553
[536]	validation_0-auc:0.76557
[537]	validation_0-auc:0.76557
[538]	validation_0-auc:0.76567
[539]	validation_0-auc:0.76569
[540]	validation_0-auc:0.76571
[541]	validation_0-auc:0.76573
[542]	validation_0-auc:0.76578
[543]	validation_0-auc:0.76581
[544]	validation_0-auc:0.76584
[545]	validation_0-auc:0.76587
[546]	validation_0-auc:0.76589
[547]	validation_0-auc:0.76592
[548]	validation_0-auc:0.76595
[549]	validation_0-auc:0.76596
[550]	validation_0-auc:0.76597
[551]	validation_0-auc:0.76599
[552]	validation_0-auc:0.76602
[553]	validation_0-auc:0.76604
[554]	validation_0-auc:0.76607
[555]	validation_0-auc:0.76608
[556]	validation_0-auc:0.76612
[557]	validation_0-auc:0.76619
[558]	validation_0-auc:0.76620
[559]	validation_0-auc:0.76624
[560]	validation_0-auc:0.76626
[561]	validation_0-auc:0.76631
[562]	validation_0-auc:0.76635
[563]	validation_0-auc:0.76636
[564]	validation_0-auc:0.76640
[565]	validation_0-auc:0.76641
[566]	validation_0-auc:0.76645
[567]	validation_0-auc:0.76647
[568]	validation_0-auc:0.76648
[569]	validation_0-auc:0.76654
[570]	validation_0-auc:0.76654
[571]	validation_0-auc:0.76655
[572]	validation_0-auc:0.76657
[573]	validation_0-auc:0.76658
[574]	validation_0-auc:0.76657
[575]	validation_0-auc:0.76660
[576]	validation_0-auc:0.76665
[577]	validation_0-auc:0.76666
[578]	validation_0-auc:0.76669
[579]	validation_0-auc:0.76669
[580]	validation_0-auc:0.76677
[581]	validation_0-auc:0.76680
[582]	validation_0-auc:0.76682
[583]	validation_0-auc:0.76683
[584]	validation_0-auc:0.76684
[585]	validation_0-auc:0.76686
[586]	validation_0-auc:0.76690
[587]	validation_0-auc:0.76692
[588]	validation_0-auc:0.76691
[589]	validation_0-auc:0.76693
[590]	validation_0-auc:0.76694
[591]	validation_0-auc:0.76696
[592]	validation_0-auc:0.76697
[593]	validation_0-auc:0.76698
[594]	validation_0-auc:0.76700
[595]	validation_0-auc:0.76704
[596]	validation_0-auc:0.76709
[597]	validation_0-auc:0.76708
[598]	validation_0-auc:0.76709
[599]	validation_0-auc:0.76709
[600]	validation_0-auc:0.76709
[601]	validation_0-auc:0.76709
[602]	validation_0-auc:0.76710
[603]	validation_0-auc:0.76714
[604]	validation_0-auc:0.76717
[605]	validation_0-auc:0.76719
[606]	validation_0-auc:0.76723
[607]	validation_0-auc:0.76725
[608]	validation_0-auc:0.76727
[609]	validation_0-auc:0.76728
[610]	validation_0-auc:0.76728
[611]	validation_0-auc:0.76732
[612]	validation_0-auc:0.76734
[613]	validation_0-auc:0.76733
[614]	validation_0-auc:0.76731
[615]	validation_0-auc:0.76732
[616]	validation_0-auc:0.76737
[617]	validation_0-auc:0.76736
[618]	validation_0-auc:0.76738
[619]	validation_0-auc:0.76740
[620]	validation_0-auc:0.76744
[621]	validation_0-auc:0.76746
[622]	validation_0-auc:0.76750
[623]	validation_0-auc:0.76753
[624]	validation_0-auc:0.76756
[625]	validation_0-auc:0.76758
[626]	validation_0-auc:0.76760
[627]	validation_0-auc:0.76762
[628]	validation_0-auc:0.76761
[629]	validation_0-auc:0.76764
[630]	validation_0-auc:0.76768
[631]	validation_0-auc:0.76768
[632]	validation_0-auc:0.76772
[633]	validation_0-auc:0.76773
[634]	validation_0-auc:0.76780
[635]	validation_0-auc:0.76785
[636]	validation_0-auc:0.76784
[637]	validation_0-auc:0.76784
[638]	validation_0-auc:0.76789
[639]	validation_0-auc:0.76794
[640]	validation_0-auc:0.76799
[641]	validation_0-auc:0.76801
[642]	validation_0-auc:0.76803
[643]	validation_0-auc:0.76803
[644]	validation_0-auc:0.76807
[645]	validation_0-auc:0.76809
[646]	validation_0-auc:0.76812
[647]	validation_0-auc:0.76814
[648]	validation_0-auc:0.76816
[649]	validation_0-auc:0.76819
[650]	validation_0-auc:0.76819
[651]	validation_0-auc:0.76821
[652]	validation_0-auc:0.76826
[653]	validation_0-auc:0.76828
[654]	validation_0-auc:0.76831
[655]	validation_0-auc:0.76832
[656]	validation_0-auc:0.76834
[657]	validation_0-auc:0.76833
[658]	validation_0-auc:0.76834
[659]	validation_0-auc:0.76837
[660]	validation_0-auc:0.76844
[661]	validation_0-auc:0.76850
[662]	validation_0-auc:0.76850
[663]	validation_0-auc:0.76851
[664]	validation_0-auc:0.76850
[665]	validation_0-auc:0.76849
[666]	validation_0-auc:0.76847
[667]	validation_0-auc:0.76848
[668]	validation_0-auc:0.76852
[669]	validation_0-auc:0.76855
[670]	validation_0-auc:0.76855
[671]	validation_0-auc:0.76856
[672]	validation_0-auc:0.76858
[673]	validation_0-auc:0.76861
[674]	validation_0-auc:0.76861
[675]	validation_0-auc:0.76863
[676]	validation_0-auc:0.76864
[677]	validation_0-auc:0.76864
[678]	validation_0-auc:0.76866
[679]	validation_0-auc:0.76866
[680]	validation_0-auc:0.76869
[681]	validation_0-auc:0.76868
[682]	validation_0-auc:0.76873
[683]	validation_0-auc:0.76873
[684]	validation_0-auc:0.76879
[685]	validation_0-auc:0.76881
[686]	validation_0-auc:0.76881
[687]	validation_0-auc:0.76885
[688]	validation_0-auc:0.76885
[689]	validation_0-auc:0.76888
[690]	validation_0-auc:0.76891
[691]	validation_0-auc:0.76891
[692]	validation_0-auc:0.76893
[693]	validation_0-auc:0.76891
[694]	validation_0-auc:0.76894
[695]	validation_0-auc:0.76895
[696]	validation_0-auc:0.76894
[697]	validation_0-auc:0.76895
[698]	validation_0-auc:0.76896
[699]	validation_0-auc:0.76900
[700]	validation_0-auc:0.76899
[701]	validation_0-auc:0.76901
[702]	validation_0-auc:0.76902
[703]	validation_0-auc:0.76902
[704]	validation_0-auc:0.76902
[705]	validation_0-auc:0.76902
[706]	validation_0-auc:0.76901
[707]	validation_0-auc:0.76905
[708]	validation_0-auc:0.76909
[709]	validation_0-auc:0.76914
[710]	validation_0-auc:0.76914
[711]	validation_0-auc:0.76917
[712]	validation_0-auc:0.76917
[713]	validation_0-auc:0.76915
[714]	validation_0-auc:0.76917
[715]	validation_0-auc:0.76921
[716]	validation_0-auc:0.76924
[717]	validation_0-auc:0.76925
[718]	validation_0-auc:0.76925
[719]	validation_0-auc:0.76926
[720]	validation_0-auc:0.76923
[721]	validation_0-auc:0.76924
[722]	validation_0-auc:0.76927
[723]	validation_0-auc:0.76929
[724]	validation_0-auc:0.76929
[725]	validation_0-auc:0.76933
[726]	validation_0-auc:0.76932
[727]	validation_0-auc:0.76931
[728]	validation_0-auc:0.76930
[729]	validation_0-auc:0.76929
[730]	validation_0-auc:0.76933
[731]	validation_0-auc:0.76931
[732]	validation_0-auc:0.76932
[733]	validation_0-auc:0.76933
[734]	validation_0-auc:0.76935
[735]	validation_0-auc:0.76935
[736]	validation_0-auc:0.76941
[737]	validation_0-auc:0.76944
[738]	validation_0-auc:0.76946
[739]	validation_0-auc:0.76944
[740]	validation_0-auc:0.76947
[741]	validation_0-auc:0.76946
[742]	validation_0-auc:0.76947
[743]	validation_0-auc:0.76948
[744]	validation_0-auc:0.76950
[745]	validation_0-auc:0.76950
[746]	validation_0-auc:0.76949
[747]	validation_0-auc:0.76948
[748]	validation_0-auc:0.76948
[749]	validation_0-auc:0.76949
[750]	validation_0-auc:0.76950
[751]	validation_0-auc:0.76949
[752]	validation_0-auc:0.76949
[753]	validation_0-auc:0.76951
[754]	validation_0-auc:0.76953
[755]	validation_0-auc:0.76955
[756]	validation_0-auc:0.76956
[757]	validation_0-auc:0.76956
[758]	validation_0-auc:0.76958
[759]	validation_0-auc:0.76959
[760]	validation_0-auc:0.76960
[761]	validation_0-auc:0.76959
[762]	validation_0-auc:0.76958
[763]	validation_0-auc:0.76960
[764]	validation_0-auc:0.76965
[765]	validation_0-auc:0.76965
[766]	validation_0-auc:0.76967
[767]	validation_0-auc:0.76968
[768]	validation_0-auc:0.76969
[769]	validation_0-auc:0.76970
[770]	validation_0-auc:0.76977
[771]	validation_0-auc:0.76980
[772]	validation_0-auc:0.76980
[773]	validation_0-auc:0.76981
[774]	validation_0-auc:0.76982
[775]	validation_0-auc:0.76984
[776]	validation_0-auc:0.76990
[777]	validation_0-auc:0.76988
[778]	validation_0-auc:0.76989
[779]	validation_0-auc:0.76987
[780]	validation_0-auc:0.76987
[781]	validation_0-auc:0.76987
[782]	validation_0-auc:0.76989
[783]	validation_0-auc:0.76988
[784]	validation_0-auc:0.76987
[785]	validation_0-auc:0.76991
[786]	validation_0-auc:0.76991
[787]	validation_0-auc:0.76993
[788]	validation_0-auc:0.76994
[789]	validation_0-auc:0.76997
[790]	validation_0-auc:0.76998
[791]	validation_0-auc:0.76999
[792]	validation_0-auc:0.77000
[793]	validation_0-auc:0.77002
[794]	validation_0-auc:0.77001
[795]	validation_0-auc:0.76999
[796]	validation_0-auc:0.76999
[797]	validation_0-auc:0.77000
[798]	validation_0-auc:0.77000
[799]	validation_0-auc:0.77002
[800]	validation_0-auc:0.77003
[801]	validation_0-auc:0.77007
[802]	validation_0-auc:0.77008
[803]	validation_0-auc:0.77009
[804]	validation_0-auc:0.77010
[805]	validation_0-auc:0.77010
[806]	validation_0-auc:0.77011
[807]	validation_0-auc:0.77012
[808]	validation_0-auc:0.77012
[809]	validation_0-auc:0.77017
[810]	validation_0-auc:0.77017
[811]	validation_0-auc:0.77021
[812]	validation_0-auc:0.77021
[813]	validation_0-auc:0.77021
[814]	validation_0-auc:0.77022
[815]	validation_0-auc:0.77021
[816]	validation_0-auc:0.77022
[817]	validation_0-auc:0.77022
[818]	validation_0-auc:0.77022
[819]	validation_0-auc:0.77023
[820]	validation_0-auc:0.77023
[821]	validation_0-auc:0.77023
[822]	validation_0-auc:0.77024
[823]	validation_0-auc:0.77024
[824]	validation_0-auc:0.77026
[825]	validation_0-auc:0.77028
[826]	validation_0-auc:0.77031
[827]	validation_0-auc:0.77032
[828]	validation_0-auc:0.77033
[829]	validation_0-auc:0.77035
[830]	validation_0-auc:0.77034
[831]	validation_0-auc:0.77033
[832]	validation_0-auc:0.77034
[833]	validation_0-auc:0.77038
[834]	validation_0-auc:0.77040
[835]	validation_0-auc:0.77041
[836]	validation_0-auc:0.77040
[837]	validation_0-auc:0.77037
[838]	validation_0-auc:0.77038
[839]	validation_0-auc:0.77038
[840]	validation_0-auc:0.77040
[841]	validation_0-auc:0.77041
[842]	validation_0-auc:0.77041
[843]	validation_0-auc:0.77044
[844]	validation_0-auc:0.77043
[845]	validation_0-auc:0.77039
[846]	validation_0-auc:0.77040
[847]	validation_0-auc:0.77040
[848]	validation_0-auc:0.77038
[849]	validation_0-auc:0.77041
[850]	validation_0-auc:0.77041
[851]	validation_0-auc:0.77041
[852]	validation_0-auc:0.77041
[853]	validation_0-auc:0.77041
[854]	validation_0-auc:0.77043
[855]	validation_0-auc:0.77042
[856]	validation_0-auc:0.77042
[857]	validation_0-auc:0.77041
[858]	validation_0-auc:0.77044
[859]	validation_0-auc:0.77046
[860]	validation_0-auc:0.77044
[861]	validation_0-auc:0.77044
[862]	validation_0-auc:0.77043
[863]	validation_0-auc:0.77044
[864]	validation_0-auc:0.77041
[865]	validation_0-auc:0.77040
[866]	validation_0-auc:0.77041
[867]	validation_0-auc:0.77043
[868]	validation_0-auc:0.77046
[869]	validation_0-auc:0.77045
[870]	validation_0-auc:0.77043
[871]	validation_0-auc:0.77046
[872]	validation_0-auc:0.77047
[873]	validation_0-auc:0.77047
[874]	validation_0-auc:0.77047
[875]	validation_0-auc:0.77047
[876]	validation_0-auc:0.77046
[877]	validation_0-auc:0.77047
[878]	validation_0-auc:0.77047
[879]	validation_0-auc:0.77047
[880]	validation_0-auc:0.77049
[881]	validation_0-auc:0.77049
[882]	validation_0-auc:0.77048
[883]	validation_0-auc:0.77047
[884]	validation_0-auc:0.77049
[885]	validation_0-auc:0.77048
[886]	validation_0-auc:0.77049
[887]	validation_0-auc:0.77048
[888]	validation_0-auc:0.77050
[889]	validation_0-auc:0.77050
[890]	validation_0-auc:0.77051
[891]	validation_0-auc:0.77051
[892]	validation_0-auc:0.77055
[893]	validation_0-auc:0.77055
[894]	validation_0-auc:0.77056
[895]	validation_0-auc:0.77056
[896]	validation_0-auc:0.77055
[897]	validation_0-auc:0.77056
[898]	validation_0-auc:0.77056
[899]	validation_0-auc:0.77057
[900]	validation_0-auc:0.77058
[901]	validation_0-auc:0.77054
[902]	validation_0-auc:0.77054
[903]	validation_0-auc:0.77056
[904]	validation_0-auc:0.77056
[905]	validation_0-auc:0.77057
[906]	validation_0-auc:0.77056
[907]	validation_0-auc:0.77056
[908]	validation_0-auc:0.77055
[909]	validation_0-auc:0.77055
[910]	validation_0-auc:0.77055
[911]	validation_0-auc:0.77055
[912]	validation_0-auc:0.77055
[913]	validation_0-auc:0.77055
[914]	validation_0-auc:0.77058
[915]	validation_0-auc:0.77057
[916]	validation_0-auc:0.77057
[917]	validation_0-auc:0.77057
[918]	validation_0-auc:0.77056
[919]	validation_0-auc:0.77057
[920]	validation_0-auc:0.77056
[921]	validation_0-auc:0.77055
[922]	validation_0-auc:0.77055
[923]	validation_0-auc:0.77057
[924]	validation_0-auc:0.77059
[925]	validation_0-auc:0.77059
[926]	validation_0-auc:0.77059
[927]	validation_0-auc:0.77062
[928]	validation_0-auc:0.77061
[929]	validation_0-auc:0.77064
[930]	validation_0-auc:0.77064
[931]	validation_0-auc:0.77063
[932]	validation_0-auc:0.77064
[933]	validation_0-auc:0.77063
[934]	validation_0-auc:0.77064
[935]	validation_0-auc:0.77065
[936]	validation_0-auc:0.77066
[937]	validation_0-auc:0.77067
[938]	validation_0-auc:0.77068
[939]	validation_0-auc:0.77068
[940]	validation_0-auc:0.77072
[941]	validation_0-auc:0.77073
[942]	validation_0-auc:0.77072
[943]	validation_0-auc:0.77070
[944]	validation_0-auc:0.77072
[945]	validation_0-auc:0.77073
[946]	validation_0-auc:0.77075
[947]	validation_0-auc:0.77074
[948]	validation_0-auc:0.77077
[949]	validation_0-auc:0.77077
[950]	validation_0-auc:0.77075
[951]	validation_0-auc:0.77075
[952]	validation_0-auc:0.77074
[953]	validation_0-auc:0.77074
[954]	validation_0-auc:0.77076
[955]	validation_0-auc:0.77074
[956]	validation_0-auc:0.77078
[957]	validation_0-auc:0.77076
[958]	validation_0-auc:0.77077
[959]	validation_0-auc:0.77078
[960]	validation_0-auc:0.77077
[961]	validation_0-auc:0.77078
[962]	validation_0-auc:0.77080
[963]	validation_0-auc:0.77080
[964]	validation_0-auc:0.77083
[965]	validation_0-auc:0.77081
[966]	validation_0-auc:0.77084
[967]	validation_0-auc:0.77082
[968]	validation_0-auc:0.77082
[969]	validation_0-auc:0.77081
[970]	validation_0-auc:0.77081
[971]	validation_0-auc:0.77081
[972]	validation_0-auc:0.77082
[973]	validation_0-auc:0.77082
[974]	validation_0-auc:0.77084
[975]	validation_0-auc:0.77082
[976]	validation_0-auc:0.77083
[977]	validation_0-auc:0.77084
[978]	validation_0-auc:0.77084
[979]	validation_0-auc:0.77085
[980]	validation_0-auc:0.77085
[981]	validation_0-auc:0.77086
[982]	validation_0-auc:0.77088
[983]	validation_0-auc:0.77088
[984]	validation_0-auc:0.77089
[985]	validation_0-auc:0.77088
[986]	validation_0-auc:0.77090
[987]	validation_0-auc:0.77093
[988]	validation_0-auc:0.77092
[989]	validation_0-auc:0.77094
[990]	validation_0-auc:0.77093
[991]	validation_0-auc:0.77093
[992]	validation_0-auc:0.77095
[993]	validation_0-auc:0.77097
[994]	validation_0-auc:0.77099
[995]	validation_0-auc:0.77100
[996]	validation_0-auc:0.77099
[997]	validation_0-auc:0.77102
[998]	validation_0-auc:0.77101
[999]	validation_0-auc:0.77099
****************************************


----------------------------------------


****************************************


Best Parameters:
	predictor__colsample_bytree: 0.8
	predictor__eta: 0.01
	predictor__max_depth: 8
	predictor__min_child_weight: 20
	predictor__n_estimators: 1000
	predictor__reg_alpha: 0.01
	predictor__reg_lambda: 0.05
	predictor__verbose_eval: False
	predictor__verbosity: 0
	sampling__sampling_strategy: 0.25
****** FINISH Best Model SMOTE: XGBoost  *****

----------------------------------------
****************************************
----------------------------------------
****** START Best Model SMOTE: LightGBM *****
Parameters:
	predictor__boosting_type: ['gbdt']
	predictor__colsample_bytree: [0.8, 0.95]
	predictor__is_unbalance: [True]
	predictor__learning_rate: [0.01, 0.05, 0.1]
	predictor__min_split_gain: [0.01, 0.02, 0.05]
	predictor__n_estimators: [5000, 10000]
	predictor__num_leaves: [30, 35]
	predictor__reg_alpha: [0.01, 0.02, 0.05]
	predictor__reg_lambda: [0.01, 0.02, 0.05]
	sampling__sampling_strategy: [0.25, 0.5, 'auto']
Fitting 5 folds for each of 1944 candidates, totalling 9720 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 17.0min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 76.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 169.3min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 306.4min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed: 492.1min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 681.8min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 756.4min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 840.4min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 905.2min
[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed: 1017.9min
[Parallel(n_jobs=-1)]: Done 6042 tasks      | elapsed: 1438.0min
[Parallel(n_jobs=-1)]: Done 7192 tasks      | elapsed: 1670.1min
[Parallel(n_jobs=-1)]: Done 8442 tasks      | elapsed: 1784.2min
[Parallel(n_jobs=-1)]: Done 9720 out of 9720 | elapsed: 1874.2min finished
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[1763]	valid's auc: 0.770978	valid's binary_logloss: 0.242476
****************************************


----------------------------------------


****************************************


Best Parameters:
	predictor__boosting_type: gbdt
	predictor__colsample_bytree: 0.8
	predictor__is_unbalance: True
	predictor__learning_rate: 0.01
	predictor__min_split_gain: 0.01
	predictor__n_estimators: 5000
	predictor__num_leaves: 35
	predictor__reg_alpha: 0.01
	predictor__reg_lambda: 0.01
	sampling__sampling_strategy: auto
****** FINISH Best Model SMOTE: LightGBM  *****

----------------------------------------
****************************************
----------------------------------------
Model Train Accuracy Score Train AUC Score Test Accuracy Score Test AUC Score Training Time Experiments Description
0 Best Model SMOTE:XGBoost 0.925 0.892 0.92 0.778 400.2210 120 [["predictor__colsample_bytree", 0.8], ["predi...
1 Best Model SMOTE:LightGBM 0.953 0.991 0.92 0.775 138.1902 3240 [["predictor__boosting_type", "gbdt"], ["predi...
CPU times: user 1h 10min 8s, sys: 1min 32s, total: 1h 11min 41s
Wall time: 1d 23h 47min 50s

Discussion of Results

There are total more than 18000 experiment performed. Best result comes out with synthetically balanced dataset and LightGBM model with Kaggle score of 0.794. Training time for best model is 138 seconds.

Second best model is developed using unbalanced dataset with Kaggle score of 0.792. Training time for second best model is 233 second. It is also obeserved that scaling and PCA has some adverse effect on prediction of results.

Lowest performing model is with PCA where Kaggle score is just above 0.6.

It is recommended to use unbalanced dataset with XGBoost model as it also represent a real scenario.

Best performing features are

  • Term of Loan (Engineered Feature)

  • Mean of External Sources (Engineered Feature)

  • Length of Employment (Given)

  • External Source Score (Given)

  • Future monthly installment ratio (Engineered)

  • Anuuity Amount to Income Ratio (Engineered)

Several features that looks important at first sight were given zero importance by model in order to predict outcome.

Some of the features that were given zero importance by models are

  • Industry where prople work

  • Requirement of certain document

  • Whether person is unemployed or student, as long as he has income

  • Whether person is first time applicant or not, it does not matter.

In [3]:
explog
Out[3]:
Model Train Accuracy Score Train AUC Score Test Accuracy Score Test AUC Score Training Time Experiments Description
0 Best Model Unbalanced:XGBoost 0.922 0.823 0.920 0.779 233.2998 360 [["predictor__colsample_bytree", 0.8], ["predi...
1 Best Model Unbalanced:LightGBM 0.922 0.990 0.842 0.760 125.9134 3240 [["predictor__boosting_type", "gbdt"], ["predi...
2 Best Model Manual Balanced 0.25:XGBoost 0.840 0.893 0.781 0.779 75.7185 360 [["predictor__colsample_bytree", 0.8], ["predi...
3 Best Model Manual Balanced 0.25:LightGBM 0.972 0.998 0.756 0.771 32.2073 3240 [["predictor__boosting_type", "gbdt"], ["predi...
4 Best Model Manual Balanced 50%:XGBoost 0.774 0.825 0.736 0.770 35.3433 360 [["predictor__colsample_bytree", 0.95], ["pred...
5 Best Model Manual Balanced 50%:LightGBM 0.986 0.999 0.727 0.765 56.6256 3240 [["predictor__boosting_type", "gbdt"], ["predi...
6 Best Model Manual Balanced 1:1XGBoost 0.747 0.830 0.702 0.772 10.8852 360 [["predictor__colsample_bytree", 0.8], ["predi...
7 Best Model Manual Balanced 1:1LightGBM 0.994 1.000 0.699 0.770 42.8156 3240 [["predictor__boosting_type", "gbdt"], ["predi...
8 Best Model Scaled:XGBoost 0.920 0.811 0.919 0.686 139.1765 360 [["predictor__colsample_bytree", 0.8], ["predi...
9 Best Model Scaled:LightGBM 0.920 0.990 0.919 0.556 98.6863 3240 [["predictor__boosting_type", "gbdt"], ["predi...
10 Best Model PCA:LightGBM 1.000 1.000 0.920 0.720 118.4770 180 [["pca__n_components", 0.95], ["predictor__boo...
11 Best Model SMOTE:XGBoost 0.925 0.892 0.920 0.778 400.2210 360 [["predictor__colsample_bytree", 0.8], ["predi...
12 Best Model SMOTE:LightGBM 0.953 0.991 0.920 0.775 138.1902 9720 [["predictor__boosting_type", "gbdt"], ["predi...

Kaggle Scores

All above model and their results are submitted to Kaggle. Best result on Kaggle is coming from LightGBM model using Synthetically balanced dataset. Below is snapshot of all submission on Kaggle.

image.png

Conclusion

At the end of Phase2, more than 25000 experiments performed in order to evaluate best model. However, SMOTE synthetically balanced dataset generates best Kaggle score of 0.794, second best model with Unbalanced dataset is recommended. Second best model represents real world scenario where clearly one class is in majority.

Further, this results can be further evaluated using Neural Network techniques and compared their result with traditional classifiers. Next phase will be more focused on developing Neural Network model and compared its result with traditional classfier resutls.

In [ ]: